-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Description
Feature request
Extend tokenizer.apply_chat_template
with functionality for training/finetuning, returning attention_masks
and (optional) labels
(for ignoring "System" and "User" messages during loss computation).
I think this requires the following steps:
- Adding support for taking in a batch of conversations (e.g.,
List[Conversation := List[Dict[str, str]]
) - Invoking the native
tokenizer.__call__()
after applying the template to each example (passing through padding, truncation, any other parameters). - Important: Adding an optional output for
labels
-- a "masked" version of the returnedinput_ids
with tokens corresponding to the System/User roles set to be ignored for loss computation (e.g., set toIGNORE_INDEX = -100
).
Motivation
The new tokenizer.apply_chat_template
feature is great, and resolves a lot of ambiguity when it comes to formatting inputs for chat-based LLMs.
However, right now it's geared for inference-time usage, only taking a single "conversation" and outputting the input_ids
(tokens) after applying the chat template.
When finetuning models on chat-based data, it would be really nice to unify the apply_chat_template
API with the tokenizer.__call__()
API, returning attention_masks
and (optionally) labels
(with "System" and "User" role text automatically ignored for loss computation).
Your contribution
I can try building a proof-of-concept for a "standard" workflow and Draft PR; I think there'd need to be a few discussions about the actual implementation details though!