Skip to content

PretrainedTokenizer::truncateHelper: prevent array_slice() error for flawed text input (summarization) #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

k00ni
Copy link
Contributor

@k00ni k00ni commented May 16, 2024

What:

  • Bug Fix

Description:

When running a summarization task, such as

$summarizer = pipeline('summarization', 'Xenova/distilbart-cnn-6-6');
$summary = $summarizer($text, maxNewTokens: 512, temperature: 0.7);

certain input may result in the following error in PretrainedTokenizer::truncateHelper:

array_slice(): Argument #1 ($array) must be of type array, null given

Bildschirmfoto vom 2024-05-16 17-10-12

My tests suggest that texts with lines, containing (almost?) no text, cause this problem (Example). Its not model-dependent.

Two tests were added to prove that the fix is working:

  1. SummarizationPipelineTest: Integration test which checks behavior using a real model and some extracted text from a PDF (original source: https://arxiv.org/abs/2309.06888). I think there is a better way to accomplish the same test result, because the test runs 10+ seconds

  2. PretrainedTokenizerTest: Unit test to check PretrainedTokenizer::truncateHelper itself. The input is flawed by design, which would trigger the error without the fix.


I marked the PR as draft because I would like to hear your opinion first.

with the if-clause in PretrainedTokenizer::truncateHelper
certain input may result in the following error:

    array_slice(): Argument CodeWithKyrian#1 ($array) must be of type array,
    null given

two tests were added to prove that the fix is working:

1. SummarizationPipelineTest:
Integration test which checks behavior using a real model
and some extracted text from a PDF. I think there is a better
way to accomplish the same test result, because this one test
runs 10+ sec. locally.

2. PretrainedTokenizerTest:
Unit test to check PretrainedTokenizer::truncateHelper itself.
The input is flawed by design, which would trigger the error
without the fix.
@CodeWithKyrian
Copy link
Owner

Thank you so much for your contribution, @k00ni. I'm happy with the changes for the truncateHelper. However, I'm a bit undecided about the tests. While I appreciate the initiative to include tests with changes like this, I'm reluctant to accept them for now.

If you notice, there aren't many tests in the library, which I'm not proud of. This is because I want to take my time to decide on the best structure for the tests. Classes like Tensor and Image can be easily tested, and I included tests for the basic tokenizer because the config file sizes are relatively small. However, the overall testing structure of the library is still largely undecided.

For now, let's leave out the tests. Since you seem to have a keen interest in testing, I would really appreciate your suggestions on how best to structure tests for the project.

@k00ni
Copy link
Contributor Author

k00ni commented May 22, 2024

I will reduce the PR to the fix and will write my feedback about the tests in a separate issue.

@k00ni k00ni marked this pull request as ready for review May 22, 2024 06:57
@CodeWithKyrian
Copy link
Owner

That's great. Looking forward to your suggestions

@CodeWithKyrian CodeWithKyrian merged commit 0618360 into CodeWithKyrian:main May 22, 2024
@k00ni k00ni deleted the fix/array-slice-truncate-helper branch May 22, 2024 07:02
@martindewawd martindewawd mentioned this pull request Mar 25, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants