Skip to content

PII Metadata for ECS Field (and Vendor Fields) #2491

@Mikaayenson

Description

@Mikaayenson

Summary

This proposal suggests enhancing Elastic Common Schema (ECS) by adding explicit metadata to field definitions, allowing fields to be marked as containing Personally Identifiable Information (PII). This metadata would facilitate automated data governance, filtering, and compliance efforts downstream.

Background

Currently, identifying and managing PII within logs often rely on manually curated lists of sensitive fields. This approach is prone to errors, difficult to scale, and requires constant maintenance as the ECS schema evolves or as new integrations introduce new fields.

By standardizing PII metadata directly within the ECS field definitions, there may be a few inherent benefits:

  • Automated Data Governance: Tools and processes can programmatically identify and handle PII fields, enabling automated redaction, anonymization, or access controls.
  • Improved Compliance: Provides a clearer, auditable mechanism for demonstrating compliance with data privacy regulations (e.g., GDPR, CCPA) by identifying sensitive data at its source.
  • Enhanced Data Understanding: Data producers and consumers gain a standardized understanding of which fields inherently carry PII risk, leading to better data hygiene from ingestion to analysis.
  • Reduced Manual Effort: Eliminates the need for maintaining separate, external PII field lists, reducing operational overhead and the risk of human error.
  • Consistency Across Ecosystem: Extends a consistent PII identification strategy across ECS fields and integrated vendor fields.

Detailed Design

The proposed solution involves adding new optional metadata fields directly to the ECS field definitions. This metadata would signal that a given field is expected to contain PII.

Proposed Field Names

  • pii: A boolean field indicating whether the field is considered PII.
  • pii_category (Optional): A keyword field to classify the type of PII (e.g., "Name/Identifier", "Email Address", "IP Address", "Free-text/Context-dependent"). This provides more granular context for handling.
  • pii_reason (Optional): A text field to briefly explain why the field is classified as PII.

Example Values for the Fields:

Below are examples showing how the proposed pii metadata would be integrated into existing ECS field definitions:

# Example 1: User Name
- name: user.name
  type: keyword
  description: "Name of the user."
  pii: true
  pii_category: "Name/Identifier"
  pii_reason: "Usernames can directly identify individuals."

# Example 2: User Email Address
- name: user.email
  type: keyword
  description: "Email address of the user."
  pii: true
  pii_category: "Email Address"
  pii_reason: "Personal email addresses directly identify individuals."

# Example 3: Source IP Address
- name: source.ip
  type: ip
  description: "IP address of the source."
  pii: true
  pii_category: "IP Address"
  pii_reason: "Public IP addresses can sometimes be linked to individuals, especially in home or small business contexts."

# Example 4: Hostname (potentially PII)
- name: host.hostname
  type: keyword
  description: "Hostname of the host."
  pii: true
  pii_category: "Hostname/Device Name"
  pii_reason: "Hostnames or device names, especially for personal devices, may contain personal identifying information."

# Example 5: File Path (context-dependent PII)
- name: file.path
  type: keyword
  description: "Full path to the file."
  pii: true
  pii_category: "File Path"
  pii_reason: "File paths often contain usernames or other identifying information (e.g., `/Users/johndoe/Documents/`)."

# Example 6: Process Command Line (free-text PII)
- name: process.command_line
  type: keyword
  description: "The full command line that was used to start a process, including the process name and all arguments."
  pii: true
  pii_category: "Free-text/Context-dependent"
  pii_reason: "Command-line arguments can contain arbitrary sensitive data, including credentials, filenames, or user-specific information."

# Example 7: Email Sender Address
- name: email.sender.address
  type: keyword
  description: "Email address of the sender."
  pii: true
  pii_category: "Email Address"
  pii_reason: "Sender's email address directly identifies an individual."

Alternatively, this metadata could reside within a top pii field list instead of an inline metadata field within every field.

Extension to Vendor Fields (Integration Fields)

This PII metadata standard should explicitly extend to fields defined within Elastic Integrations (e.g., in packages//data_stream//fields/fields.yml). When integrations define or map fields, they should be able to apply the pii, pii_category, and pii_reason metadata to their respective field definitions.

For instance, considering the AWS CloudTrail integration fields:

# Example: AWS CloudTrail fields with proposed PII metadata
- name: aws.cloudtrail
  type: group
  fields:
    # ... other fields
    - name: user_identity
      type: group
      fields:
        - name: type
          type: keyword
          description: The type of the identity
        - name: arn
          type: keyword
          description: The Amazon Resource Name (ARN) of the principal that made the call.
          pii: true
          pii_category: "Identifier"
          pii_reason: "ARNs can uniquely identify cloud identities, which may be tied to individuals or roles."
        - name: principal_id
          type: keyword
          description: The internal ID of the entity that was used to get credentials.
          pii: true
          pii_category: "Identifier"
          pii_reason: "Principal IDs are unique identifiers that can be linked to individuals or roles."
        # ... other fields in user_identity
    # ... other fields in aws.cloudtrail

# Example: target.entity.id and actor.entity.id
- name: related.entity
  description: "A collection of all entity identifiers associated with the document."
  type: keyword
  pii: true
  pii_category: "Identifier"
  pii_reason: "This field aggregates various entity identifiers (e.g., cloud resource IDs, ARNs, email addresses) which can be PII."

- name: target
  type: group
  fields:
    - name: entity
      type: group
      fields:
        - name: id
          type: keyword
          pii: true
          pii_category: "Identifier"
          pii_reason: "Target entity IDs can be linked to specific users or resources that contain PII."

- name: actor
  type: group
  fields:
    - name: id
      type: keyword
      pii: true
      pii_category: "Identifier"
      pii_reason: "Actor entity IDs represent the initiating entity, often a user, and therefore contain PII."

It's also important to consider "free-text" or "flattened" fields (like aws.cloudtrail.request_parameters, response_elements, additional_eventdata, service_event_details) where PII might be present within unstructured JSON. While marking the top-level flattened field as pii: true provides a general warning, explicit guidance or tooling for inspecting the contents of such fields for PII at ingest time would be beneficial, as granular PII tagging isn't feasible at the schema level for dynamic content. This recommendation primarily focuses on explicitly defined fields.

Relevant Existing Issues

This proposal builds upon the general concept of enriching field metadata, as discussed in elastic/ecs#68, which IINM, explores adding a type property to describe the nature of fields. While that issue is broader in scope, the introduction of pii metadata aligns with the idea of providing more descriptive and actionable information directly within the ECS field definitions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions