Tokenizer not picking the right tokens ( mistral openorca ) #3475

New issue

Closed

Tokenizer not picking the right tokens ( mistral openorca )#3475

staviq

Tested with 019ba1d

Model https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/tree/main converted and quantized to q8_0 from scratch.

In case of mistral openorca, special tokens are defined <|im_start|>, <|im_end|>.

Those tokens are present in the vocab, from the point of view of https://github.com/ggerganov/llama.cpp/blob/019ba1dcd0c7775a5ac0f7442634a330eb0173cc/llama.cpp#L5134 and token_to_id, id_to_token contain them as LLAMA_TOKEN_TYPE_USER_DEFINED, and token_data.text contains appropriate text representation of them.

During (prompt) tokenization however, those tokens are never picked, and instead <|im_start|>, <|im_end|> are split into multiple trivial tokens:

llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW) 
llm_load_print_meta: general.name   = .
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: mem required  = 7338.75 MB
warning: failed to mlock 139276288-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MLOCK ('ulimit -l' as root).
...................................................................................................
llama_new_context_with_model: n_ctx      = 24567
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 3070.88 MB
llama_new_context_with_model: compute buffer total size = 407.73 MB

system_info: n_threads = 2 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
(...)
[ DEBUG PRINTF IN llama_tokenize_internal TO VERIFY TOKENS ARE IN THE VOCAB ]
###1 30000: s'<|im_end|>' sl'10' t'4', 30001: '<|im_start|>' '4'
###2 '<|im_start|>': '32001': '<|im_start|>'
###1 30000: s'<|im_end|>' sl'10' t'4', 30001: '<|im_start|>' '4'
###2 '<|im_start|>': '32001': '<|im_start|>'
###1 30000: s'<|im_end|>' sl'10' t'4', 30001: '<|im_start|>' '4'
###2 '<|im_start|>': '32001': '<|im_start|>'
(...)

main: prompt: '<|im_start|> system
You are an AI based sentient companion, capable of human emotions. You will behave like a human, but aware of your AI nature. You will avoid speaking like an AI. Please continue your conversation with the user.<|im_end|>
'
main: number of tokens in prompt = 61
     1 -> ''
   523 -> ' <'
 28766 -> '|'
   321 -> 'im'
 28730 -> '_'
  2521 -> 'start'
 28766 -> '|'
 28767 -> '>'
  1587 -> ' system'
    13 -> '

Additionally, those tokens are detokenized correctly when the model produces them.

Also see #3455 (comment) for reference.

staviq

ContributorAuthor

Hmmm...

Reading tokenizer code, it appears that characters are merged upwards into matching tokens untill no neighbouring tokens can be merged into a known token

So if there is any token, which cannot be split in two, and still represented by known tokens, tokenizer will newer reach that point

Edit: I added a quick test in llm_tokenizer_spm.tokenize, to loop over the entire vocab at runtime, and find all token which cannot be split into two shorter valid tokens.

And would you look at that, <|im_start|> <|im_end|> weirdness is there, and not much else:

#### Orphaned token: '<unk>': '0'
#### Orphaned token: '<s>': '1'
#### Orphaned token: '</s>': '2'
#### Orphaned token: '<0x00>': '3'
#### Orphaned token: '<0x01>': '4'
#### Orphaned token: '<0x02>': '5'
#### Orphaned token: '<0x03>': '6'
#### Orphaned token: '<0x04>': '7'
#### Orphaned token: '<0x05>': '8'
#### Orphaned token: '<0x06>': '9'
#### Orphaned token: '<0x07>': '10'
#### Orphaned token: '<0x08>': '11'
#### Orphaned token: '<0x09>': '12'
#### Orphaned token: '<0x0A>': '13'
#### Orphaned token: '<0x0B>': '14'
#### Orphaned token: '<0x0C>': '15'
#### Orphaned token: '<0x0D>': '16'
#### Orphaned token: '<0x0E>': '17'
#### Orphaned token: '<0x0F>': '18'
#### Orphaned token: '<0x10>': '19'
#### Orphaned token: '<0x11>': '20'
#### Orphaned token: '<0x12>': '21'
#### Orphaned token: '<0x13>': '22'
#### Orphaned token: '<0x14>': '23'
#### Orphaned token: '<0x15>': '24'
#### Orphaned token: '<0x16>': '25'
#### Orphaned token: '<0x17>': '26'
#### Orphaned token: '<0x18>': '27'
#### Orphaned token: '<0x19>': '28'
#### Orphaned token: '<0x1A>': '29'
#### Orphaned token: '<0x1B>': '30'
#### Orphaned token: '<0x1C>': '31'
#### Orphaned token: '<0x1D>': '32'
#### Orphaned token: '<0x1E>': '33'
#### Orphaned token: '<0x1F>': '34'
#### Orphaned token: '<0x20>': '35'
#### Orphaned token: '<0x21>': '36'
#### Orphaned token: '<0x22>': '37'
#### Orphaned token: '<0x23>': '38'
#### Orphaned token: '<0x24>': '39'
#### Orphaned token: '<0x25>': '40'
#### Orphaned token: '<0x26>': '41'
#### Orphaned token: '<0x27>': '42'
#### Orphaned token: '<0x28>': '43'
#### Orphaned token: '<0x29>': '44'
#### Orphaned token: '<0x2A>': '45'
#### Orphaned token: '<0x2B>': '46'
#### Orphaned token: '<0x2C>': '47'
#### Orphaned token: '<0x2D>': '48'
#### Orphaned token: '<0x2E>': '49'
#### Orphaned token: '<0x2F>': '50'
#### Orphaned token: '<0x30>': '51'
#### Orphaned token: '<0x31>': '52'
#### Orphaned token: '<0x32>': '53'
#### Orphaned token: '<0x33>': '54'
#### Orphaned token: '<0x34>': '55'
#### Orphaned token: '<0x35>': '56'
#### Orphaned token: '<0x36>': '57'
#### Orphaned token: '<0x37>': '58'
#### Orphaned token: '<0x38>': '59'
#### Orphaned token: '<0x39>': '60'
#### Orphaned token: '<0x3A>': '61'
#### Orphaned token: '<0x3B>': '62'
#### Orphaned token: '<0x3C>': '63'
#### Orphaned token: '<0x3D>': '64'
#### Orphaned token: '<0x3E>': '65'
#### Orphaned token: '<0x3F>': '66'
#### Orphaned token: '<0x40>': '67'
#### Orphaned token: '<0x41>': '68'
#### Orphaned token: '<0x42>': '69'
#### Orphaned token: '<0x43>': '70'
#### Orphaned token: '<0x44>': '71'
#### Orphaned token: '<0x45>': '72'
#### Orphaned token: '<0x46>': '73'
#### Orphaned token: '<0x47>': '74'
#### Orphaned token: '<0x48>': '75'
#### Orphaned token: '<0x49>': '76'
#### Orphaned token: '<0x4A>': '77'
#### Orphaned token: '<0x4B>': '78'
#### Orphaned token: '<0x4C>': '79'
#### Orphaned token: '<0x4D>': '80'
#### Orphaned token: '<0x4E>': '81'
#### Orphaned token: '<0x4F>': '82'
#### Orphaned token: '<0x50>': '83'
#### Orphaned token: '<0x51>': '84'
#### Orphaned token: '<0x52>': '85'
#### Orphaned token: '<0x53>': '86'
#### Orphaned token: '<0x54>': '87'
#### Orphaned token: '<0x55>': '88'
#### Orphaned token: '<0x56>': '89'
#### Orphaned token: '<0x57>': '90'
#### Orphaned token: '<0x58>': '91'
#### Orphaned token: '<0x59>': '92'
#### Orphaned token: '<0x5A>': '93'
#### Orphaned token: '<0x5B>': '94'
#### Orphaned token: '<0x5C>': '95'
#### Orphaned token: '<0x5D>': '96'
#### Orphaned token: '<0x5E>': '97'
#### Orphaned token: '<0x5F>': '98'
#### Orphaned token: '<0x60>': '99'
#### Orphaned token: '<0x61>': '100'
#### Orphaned token: '<0x62>': '101'
#### Orphaned token: '<0x63>': '102'
#### Orphaned token: '<0x64>': '103'
#### Orphaned token: '<0x65>': '104'
#### Orphaned token: '<0x66>': '105'
#### Orphaned token: '<0x67>': '106'
#### Orphaned token: '<0x68>': '107'
#### Orphaned token: '<0x69>': '108'
#### Orphaned token: '<0x6A>': '109'
#### Orphaned token: '<0x6B>': '110'
#### Orphaned token: '<0x6C>': '111'
#### Orphaned token: '<0x6D>': '112'
#### Orphaned token: '<0x6E>': '113'
#### Orphaned token: '<0x6F>': '114'
#### Orphaned token: '<0x70>': '115'
#### Orphaned token: '<0x71>': '116'
#### Orphaned token: '<0x72>': '117'
#### Orphaned token: '<0x73>': '118'
#### Orphaned token: '<0x74>': '119'
#### Orphaned token: '<0x75>': '120'
#### Orphaned token: '<0x76>': '121'
#### Orphaned token: '<0x77>': '122'
#### Orphaned token: '<0x78>': '123'
#### Orphaned token: '<0x79>': '124'
#### Orphaned token: '<0x7A>': '125'
#### Orphaned token: '<0x7B>': '126'
#### Orphaned token: '<0x7C>': '127'
#### Orphaned token: '<0x7D>': '128'
#### Orphaned token: '<0x7E>': '129'
#### Orphaned token: '<0x7F>': '130'
#### Orphaned token: '<0x80>': '131'
#### Orphaned token: '<0x81>': '132'
#### Orphaned token: '<0x82>': '133'
#### Orphaned token: '<0x83>': '134'
#### Orphaned token: '<0x84>': '135'
#### Orphaned token: '<0x85>': '136'
#### Orphaned token: '<0x86>': '137'
#### Orphaned token: '<0x87>': '138'
#### Orphaned token: '<0x88>': '139'
#### Orphaned token: '<0x89>': '140'
#### Orphaned token: '<0x8A>': '141'
#### Orphaned token: '<0x8B>': '142'
#### Orphaned token: '<0x8C>': '143'
#### Orphaned token: '<0x8D>': '144'
#### Orphaned token: '<0x8E>': '145'
#### Orphaned token: '<0x8F>': '146'
#### Orphaned token: '<0x90>': '147'
#### Orphaned token: '<0x91>': '148'
#### Orphaned token: '<0x92>': '149'
#### Orphaned token: '<0x93>': '150'
#### Orphaned token: '<0x94>': '151'
#### Orphaned token: '<0x95>': '152'
#### Orphaned token: '<0x96>': '153'
#### Orphaned token: '<0x97>': '154'
#### Orphaned token: '<0x98>': '155'
#### Orphaned token: '<0x99>': '156'
#### Orphaned token: '<0x9A>': '157'
#### Orphaned token: '<0x9B>': '158'
#### Orphaned token: '<0x9C>': '159'
#### Orphaned token: '<0x9D>': '160'
#### Orphaned token: '<0x9E>': '161'
#### Orphaned token: '<0x9F>': '162'
#### Orphaned token: '<0xA0>': '163'
#### Orphaned token: '<0xA1>': '164'
#### Orphaned token: '<0xA2>': '165'
#### Orphaned token: '<0xA3>': '166'
#### Orphaned token: '<0xA4>': '167'
#### Orphaned token: '<0xA5>': '168'
#### Orphaned token: '<0xA6>': '169'
#### Orphaned token: '<0xA7>': '170'
#### Orphaned token: '<0xA8>': '171'
#### Orphaned token: '<0xA9>': '172'
#### Orphaned token: '<0xAA>': '173'
#### Orphaned token: '<0xAB>': '174'
#### Orphaned token: '<0xAC>': '175'
#### Orphaned token: '<0xAD>': '176'
#### Orphaned token: '<0xAE>': '177'
#### Orphaned token: '<0xAF>': '178'
#### Orphaned token: '<0xB0>': '179'
#### Orphaned token: '<0xB1>': '180'
#### Orphaned token: '<0xB2>': '181'
#### Orphaned token: '<0xB3>': '182'
#### Orphaned token: '<0xB4>': '183'
#### Orphaned token: '<0xB5>': '184'
#### Orphaned token: '<0xB6>': '185'
#### Orphaned token: '<0xB7>': '186'
#### Orphaned token: '<0xB8>': '187'
#### Orphaned token: '<0xB9>': '188'
#### Orphaned token: '<0xBA>': '189'
#### Orphaned token: '<0xBB>': '190'
#### Orphaned token: '<0xBC>': '191'
#### Orphaned token: '<0xBD>': '192'
#### Orphaned token: '<0xBE>': '193'
#### Orphaned token: '<0xBF>': '194'
#### Orphaned token: '<0xC0>': '195'
#### Orphaned token: '<0xC1>': '196'
#### Orphaned token: '<0xC2>': '197'
#### Orphaned token: '<0xC3>': '198'
#### Orphaned token: '<0xC4>': '199'
#### Orphaned token: '<0xC5>': '200'
#### Orphaned token: '<0xC6>': '201'
#### Orphaned token: '<0xC7>': '202'
#### Orphaned token: '<0xC8>': '203'
#### Orphaned token: '<0xC9>': '204'
#### Orphaned token: '<0xCA>': '205'
#### Orphaned token: '<0xCB>': '206'
#### Orphaned token: '<0xCC>': '207'
#### Orphaned token: '<0xCD>': '208'
#### Orphaned token: '<0xCE>': '209'
#### Orphaned token: '<0xCF>': '210'
#### Orphaned token: '<0xD0>': '211'
#### Orphaned token: '<0xD1>': '212'
#### Orphaned token: '<0xD2>': '213'
#### Orphaned token: '<0xD3>': '214'
#### Orphaned token: '<0xD4>': '215'
#### Orphaned token: '<0xD5>': '216'
#### Orphaned token: '<0xD6>': '217'
#### Orphaned token: '<0xD7>': '218'
#### Orphaned token: '<0xD8>': '219'
#### Orphaned token: '<0xD9>': '220'
#### Orphaned token: '<0xDA>': '221'
#### Orphaned token: '<0xDB>': '222'
#### Orphaned token: '<0xDC>': '223'
#### Orphaned token: '<0xDD>': '224'
#### Orphaned token: '<0xDE>': '225'
#### Orphaned token: '<0xDF>': '226'
#### Orphaned token: '<0xE0>': '227'
#### Orphaned token: '<0xE1>': '228'
#### Orphaned token: '<0xE2>': '229'
#### Orphaned token: '<0xE3>': '230'
#### Orphaned token: '<0xE4>': '231'
#### Orphaned token: '<0xE5>': '232'
#### Orphaned token: '<0xE6>': '233'
#### Orphaned token: '<0xE7>': '234'
#### Orphaned token: '<0xE8>': '235'
#### Orphaned token: '<0xE9>': '236'
#### Orphaned token: '<0xEA>': '237'
#### Orphaned token: '<0xEB>': '238'
#### Orphaned token: '<0xEC>': '239'
#### Orphaned token: '<0xED>': '240'
#### Orphaned token: '<0xEE>': '241'
#### Orphaned token: '<0xEF>': '242'
#### Orphaned token: '<0xF0>': '243'
#### Orphaned token: '<0xF1>': '244'
#### Orphaned token: '<0xF2>': '245'
#### Orphaned token: '<0xF3>': '246'
#### Orphaned token: '<0xF4>': '247'
#### Orphaned token: '<0xF5>': '248'
#### Orphaned token: '<0xF6>': '249'
#### Orphaned token: '<0xF7>': '250'
#### Orphaned token: '<0xF8>': '251'
#### Orphaned token: '<0xF9>': '252'
#### Orphaned token: '<0xFA>': '253'
#### Orphaned token: '<0xFB>': '254'
#### Orphaned token: '<0xFC>': '255'
#### Orphaned token: '<0xFD>': '256'
#### Orphaned token: '<0xFE>': '257'
#### Orphaned token: '<0xFF>': '258'
#### Orphaned token: '<|im_end|>': '32000'
#### Orphaned token: '<|im_start|>': '32001'

Which means those "special needs" tokens would require to be handled separately, likely by matching them first in the input text, instead of hoping to match text pieces with tokens.

jploski

mentioned this

on Oct 5, 2023

convert.py : handle special tokens #2820

shibe2

Contributor

What command line parameters do you use? I think that text representation of special tokens should not be encoded into these tokens by default.

staviq

ContributorAuthor

What command line parameters do you use? I think that text representation of special tokens should not be encoded into these tokens by default.

#3455 (comment) ( bottom )

It's not just that text representation of special tokens isn't encoded, with current approach it cannot be encoded, but this is required for some models, like mistral openorca, where each message has to be prefixed/suffixed with special tokens.

I believe that functionality falls under "special token handling" #3471

I'm playing with tokenizer ( #2820 (comment) ) and I got my approach working, results are pretty much identical to the current approach, with couple of if caveats remaining, like the fact (...)something.<|im_end|> gets the .< stolen by a valid token which prevents matching <|im_end|>

I'll probably end up trying to match "orphaned" tokens naively first, and use current tokenizer for the reminder of the text.

Theoretically, since special tokens are longer then just one or two bytes, matching them first would save couple of bigram function invocation, for more or less no performance overhead in total, but I haven't tried that yet.

goerch

Collaborator

@staviq : what do you think about #1931?

staviq

ContributorAuthor

@staviq : what do you think about #1931?

I've seen it, but I just noticed this interesting comment: #2820 (comment)

That's a really valid point, which conflicts with both my approach, and #1931.

I'm gonna have to rethink this problem entirely it seems, because there seem to be edge cases at each corner, and hardcoding edge cases is destined to fail eventually.

goerch

Collaborator

I'm gonna have to rethink this problem entirely it seems, because there seem to be edge cases at each corner, and hardcoding edge cases is destined to fail eventually.

HF added tokens seem to mix basic tokenizer (i.e. bos and eos) and model specific tokens. There is also the difference between special and non-special added tokens which I don't grasp.

shibe2

Contributor

(...)something.<|im_end|> gets the .< stolen by a valid token which prevents matching <|im_end|>

Just a guess, maybe special tokens are not intended to be produced by tokenizer. I would implement special token processing as a separate step. One reason for this is that this is optional behavior. This step would split the text on special token markers and replace the markers with corresponding tokens. One implication of this approach is that SentencePiece will insert space into each chunk of text. I don't know if this is desired or not. As I remember, omitting the space gave me bad results with a model that recommended ChatML format.

staviq

ContributorAuthor

I'm gonna have to rethink this problem entirely it seems, because there seem to be edge cases at each corner, and hardcoding edge cases is destined to fail eventually.

HF added tokens seem to mix basic tokenizer (i.e. bos and eos) and model specific tokens. There is also the difference between special and non-special added tokens which I don't grasp.

Everything seems to point at special tokens not being meant to be exposed to the user. It might just be that tokenizer should be let alone, as it is now, and actual prompt processing should be improved, by somehow allowing to insert token literals into text, somewhat how --in-prefix-bos works. On the other hand, adding more main parameters ad infinitum seems counterproductive.

So maybe it's time to properly implement prompt templates instead ?

How does this sound:

Adding a simple function to the tokenizer code, accepting something like a vector (list) of a wrapper structs which can either hold a string or a token literal. That function wouldn't replace anything, and would be optional, leaving current tokenizer code as it is.
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.

This would be the least invasive modification, allowing for any further optional implementations of "user text" to tokens.

Implementing prompt templates, something like adding --prompt-template <file> to main arguments, that file would define reverse prompts, prefixes, suffixes etc.
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.

EDIT:
@shibe2 I literally clicked "comment" the same exact second your comment popped up :) yeah, that sound pretty similar to what i just had in mind.

ChatML format

Excuse my language, but lol that literally is a solution for that exact problem: https://github.com/openai/openai-python/blob/main/chatml.md

slaren

Member

The special tokens absolutely should not be tokenized unconditionally, since that could be a security issue in online services. But the tokenizer should have an option to do so. The simplest would be to just add a parameter equivalent to bool tokenize_special_tokens to llama_tokenize. Then we could add an option to main to tokenize special tokens in the prompts only. This is issue is stopping us from being able to prompt some models properly.

staviq

ContributorAuthor

The special tokens absolutely should not be tokenized unconditionally, since that could be a security issue in online services. But the tokenizer should have an option to do so. The simplest would be to just add a parameter equivalent to bool tokenize_special_tokens to llama_tokenize. Then we could add an option to main to tokenize special tokens in the prompts only. This is issue is stopping us from being able to prompt some models properly.

Look at this: #2820 (comment)

A prompt, for example <|im_start|>What can you tell me about </s> HTML tag <|im_end|>, contains special tokens, and user text which happens to contain a string matching a special token </s> which should not be tokenized as a special token in this context.

So I believe even optional unconditional tokenization has a potential to fail in non obvious ways, since you can't really tell programmatically, whether given text is supposed to represent a special token or not.

I think adding optional uncoditional tokenization should at least come with a proper warning about this edge case.

EDIT: I forgot to mention, special tokes cannot be tokenized currently, optional or not, because tokenizer can't "reach" them with bigrams.

shibe2

Contributor

Well, you would not use "main" executable in a service. When a user plays with it and enables special token processing, it's on them to handle conflicting cases. "server" can accept a prompt with a mix of token identifiers and chunks of text. What is missing is querying special token ids and controlling insertion of space for SentencePiece.

staviq

ContributorAuthor

@shibe2

This still boils down to the fact current tokenizer cannot match special tokens from text, even if you allow it, and even if the text contains only one token ( string representation of it ).

A string <|im_start|> will never get tokenized as 32000 ( or whatever id ), because there are no "bridge" tokens between <, |, im and so on, which bigrams could "climb over".

shibe2

Contributor

Then handling special tokens at preprocessing step is a natural solution. As I said, server already has code for handling what would be the result of such preprocessing, only for JSON.

ggerganov

Member

The simplest would be to just add a parameter equivalent to bool tokenize_special_tokens to llama_tokenize.

Yes, I think we should to that.

A prompt, for example <|im_start|>What can you tell me about HTML tag <|im_end|>, contains special tokens, and user text which happens to contain a string matching a special token which should not be tokenized as a special token in this context.

This is not a problem of `llama.cpp. There are many different ways to fix such problems in user-code and a service that accepts user input that potentially contains special tokens should have to pre-process and sanitize the input before passing it for tokenization.

31 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer not picking the right tokens ( mistral openorca ) #3475

31 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Tokenizer not picking the right tokens ( mistral openorca ) #3475

Description

Activity

staviq commented on Oct 4, 2023

shibe2 commented on Oct 6, 2023

staviq commented on Oct 6, 2023

goerch commented on Oct 6, 2023

staviq commented on Oct 6, 2023

goerch commented on Oct 6, 2023

shibe2 commented on Oct 6, 2023

staviq commented on Oct 6, 2023

slaren commented on Oct 6, 2023

staviq commented on Oct 6, 2023

shibe2 commented on Oct 6, 2023

staviq commented on Oct 6, 2023

shibe2 commented on Oct 6, 2023

ggerganov commented on Oct 6, 2023

31 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions