Skip to content

fix: digits pre-tokenizer returning empty array for text with no digits #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,13 @@ Arguments:
can
specify it here. This downloads any additional configuration or data needed for that task.
- `[options]` (optional): Additional options to customize the download process.
- `-cache_dir=<directory>`: Choose where to save the models. If you've got a preferred storage spot, mention it
- `--cache-dir=<directory>`: Choose where to save the models. If you've got a preferred storage spot, mention it
here. Otherwise, it goes to the default cache location. You can use the shorthand `-c` instead of `--cache_dir`.
- `--quantized=<true|false>`: Decide whether you want the quantized version of the model, which is smaller and
faster. The default is true, but if for some reason you prefer the full version, you can set this to false. You
can use the shorthand `-q` instead of `--quantized`. Example: `--quantized=false`, `-q false`.
- `--model-filename=<filename>`: Specify the exact model filename to download (without the `.onnx` suffix. Eg. "
model" or "model_quantized".

The `download` command will download the model weights and save them to the cache directory. The next time you use the
model, TransformersPHP will use the cached weights instead of downloading them again.
Expand Down Expand Up @@ -199,7 +201,7 @@ OpenMP is a set of compiler directives and library routines that enable parallel
programs. TransformersPHP uses OpenMP to enable multithreaded operations in the Tensors, which can improve performance
on multi-core systems. OpenMP is not required, but it can provide a significant performance boost for some operations.
Checkout the [OpenMP website](https://www.openmp.org/) for more information on how to install and configure OpenMP on
your system.
your system.

Example: On Ubuntu, you can install OpenMP using the following command:

Expand Down
10 changes: 9 additions & 1 deletion src/Commands/DownloadModelCommand.php
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,20 @@ protected function configure(): void

$this->addOption(
'quantized',
null,
'q',
InputOption::VALUE_OPTIONAL,
'Whether to download the quantized version of the model.',
true
);

$this->addOption(
'model-filename',
null,
InputOption::VALUE_OPTIONAL,
'The filename of the exact model weights version to download.',
null
);

}

protected function execute(InputInterface $input, OutputInterface $output): int
Expand Down
7 changes: 6 additions & 1 deletion src/PreTokenizers/DigitsPreTokenizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,21 @@ class DigitsPreTokenizer extends PreTokenizer
{

protected string $pattern;

public function __construct(protected array $config)
{
$individualDigits = $this->config['individual_digits'] ? '' : '+';

$digitPattern = "[^\\d]+|\\d$individualDigits";

$this->pattern = "/$digitPattern/u";

}

public function preTokenizeText(string|array $text, array $options): array
{
return preg_split($this->pattern, $text, -1, PREG_SPLIT_NO_EMPTY) ?? [];
preg_match_all($this->pattern, $text, $matches, PREG_SPLIT_NO_EMPTY);

return $matches[0] ?? [];
}
}