PretrainedTokenizer::truncateHelper: prevent array_slice() error for flawed text input (summarization) #36

k00ni · 2024-05-16T15:09:14Z

What:

Bug Fix

Description:

When running a summarization task, such as

$summarizer = pipeline('summarization', 'Xenova/distilbart-cnn-6-6');
$summary = $summarizer($text, maxNewTokens: 512, temperature: 0.7);

certain input may result in the following error in PretrainedTokenizer::truncateHelper:

array_slice(): Argument #1 ($array) must be of type array, null given

My tests suggest that texts with lines, containing (almost?) no text, cause this problem (Example). Its not model-dependent.

Two tests were added to prove that the fix is working:

SummarizationPipelineTest: Integration test which checks behavior using a real model and some extracted text from a PDF (original source: https://arxiv.org/abs/2309.06888). I think there is a better way to accomplish the same test result, because the test runs 10+ seconds
PretrainedTokenizerTest: Unit test to check PretrainedTokenizer::truncateHelper itself. The input is flawed by design, which would trigger the error without the fix.

I marked the PR as draft because I would like to hear your opinion first.

with the if-clause in PretrainedTokenizer::truncateHelper certain input may result in the following error: array_slice(): Argument CodeWithKyrian#1 ($array) must be of type array, null given two tests were added to prove that the fix is working: 1. SummarizationPipelineTest: Integration test which checks behavior using a real model and some extracted text from a PDF. I think there is a better way to accomplish the same test result, because this one test runs 10+ sec. locally. 2. PretrainedTokenizerTest: Unit test to check PretrainedTokenizer::truncateHelper itself. The input is flawed by design, which would trigger the error without the fix.

CodeWithKyrian · 2024-05-20T17:08:38Z

Thank you so much for your contribution, @k00ni. I'm happy with the changes for the truncateHelper. However, I'm a bit undecided about the tests. While I appreciate the initiative to include tests with changes like this, I'm reluctant to accept them for now.

If you notice, there aren't many tests in the library, which I'm not proud of. This is because I want to take my time to decide on the best structure for the tests. Classes like Tensor and Image can be easily tested, and I included tests for the basic tokenizer because the config file sizes are relatively small. However, the overall testing structure of the library is still largely undecided.

For now, let's leave out the tests. Since you seem to have a keen interest in testing, I would really appreciate your suggestions on how best to structure tests for the project.

k00ni · 2024-05-22T06:55:47Z

I will reduce the PR to the fix and will write my feedback about the tests in a separate issue.

CodeWithKyrian · 2024-05-22T06:59:16Z

That's great. Looking forward to your suggestions

removed tests for now due to feedback of the author

b70d108

k00ni marked this pull request as ready for review May 22, 2024 06:57

CodeWithKyrian merged commit 0618360 into CodeWithKyrian:main May 22, 2024

k00ni deleted the fix/array-slice-truncate-helper branch May 22, 2024 07:02

k00ni mentioned this pull request May 22, 2024

Establishing the test infrastructure #38

Open

martindewawd mentioned this pull request Mar 25, 2025

Matlib version error #88

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PretrainedTokenizer::truncateHelper: prevent array_slice() error for flawed text input (summarization) #36

PretrainedTokenizer::truncateHelper: prevent array_slice() error for flawed text input (summarization) #36

Uh oh!

k00ni commented May 16, 2024 •

edited

Loading

Uh oh!

CodeWithKyrian commented May 20, 2024

Uh oh!

k00ni commented May 22, 2024

Uh oh!

CodeWithKyrian commented May 22, 2024

Uh oh!

Uh oh!

PretrainedTokenizer::truncateHelper: prevent array_slice() error for flawed text input (summarization) #36

PretrainedTokenizer::truncateHelper: prevent array_slice() error for flawed text input (summarization) #36

Uh oh!

Conversation

k00ni commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What:

Description:

Uh oh!

CodeWithKyrian commented May 20, 2024

Uh oh!

k00ni commented May 22, 2024

Uh oh!

CodeWithKyrian commented May 22, 2024

Uh oh!

Uh oh!

k00ni commented May 16, 2024 •

edited

Loading