Skip to content

Extend Chat Template Tokenization for Training/Finetuning #27609

@siddk

Description

@siddk

Feature request

Extend tokenizer.apply_chat_template with functionality for training/finetuning, returning attention_masks and (optional) labels (for ignoring "System" and "User" messages during loss computation).

I think this requires the following steps:

  • Adding support for taking in a batch of conversations (e.g., List[Conversation := List[Dict[str, str]])
  • Invoking the native tokenizer.__call__() after applying the template to each example (passing through padding, truncation, any other parameters).
  • Important: Adding an optional output for labels -- a "masked" version of the returned input_ids with tokens corresponding to the System/User roles set to be ignored for loss computation (e.g., set to IGNORE_INDEX = -100).

Motivation

The new tokenizer.apply_chat_template feature is great, and resolves a lot of ambiguity when it comes to formatting inputs for chat-based LLMs.

However, right now it's geared for inference-time usage, only taking a single "conversation" and outputting the input_ids (tokens) after applying the chat template.

When finetuning models on chat-based data, it would be really nice to unify the apply_chat_template API with the tokenizer.__call__() API, returning attention_masks and (optionally) labels (with "System" and "User" role text automatically ignored for loss computation).

Your contribution

I can try building a proof-of-concept for a "standard" workflow and Draft PR; I think there'd need to be a few discussions about the actual implementation details though!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions