pred_loss decrease fast while avg_acc stay at 50% #32

jiqiujia · 2018-10-23T09:52:13Z

I try to run the code on a small dataset and I find that pred_loss decrease fast while avg_acc stay at 50%. It is strange to me since decrease in pred_loss should indicates increase in accuracy.

wenhaozheng-nju · 2018-10-23T10:18:50Z

I also meet the same problem in small dataset.

NiHaoUCAS · 2018-10-23T12:43:29Z

me too

codertimo · 2018-10-23T13:17:16Z

Hmm interesting.. Is this the result of 0.0.1a4 version?
And How did you guys print out that result?

NiHaoUCAS · 2018-10-23T13:36:52Z

Hmm interesting.. Is this the result of 0.0.1a4 version?
And How did you guys print out that result?

0.0.01a3 vesion
the result is print out by bert cmd , no any modify.

jiqiujia · 2018-10-23T16:16:36Z

Hmm interesting.. Is this the result of 0.0.1a4 version?
And How did you guys print out that result?

I try using 0.0.1a4 and the result is the same

codertimo · 2018-10-24T01:19:35Z

Hmmm... anyone have any clues?

yangze01 · 2018-10-24T03:27:38Z

I try using different data: continuous sentence pair from same document, concat continuous sentence as longer sentence , query and document pair, the result is the same. I also found that there is a big gap between next_loss and mask_loss although they use the same loss function.

cairoHy · 2018-10-24T12:46:58Z

Probably the criterion loss function is the problem.

# shape [10, 2], not very accurate output
out = torch.tensor([[ -8.4014,  -0.0002],
        [-10.3151,  -0.0000],
        [ -8.8440,  -0.0001],
        [ -7.5148,  -0.0005],
        [-11.0145,  -0.0000],
        [-10.9770,  -0.0000],
        [-13.3770,  -0.0000],
        [ -9.5733,  -0.0001],
        [ -9.5957,  -0.0001],
        [ -9.0712,  -0.0001]])
# shape [10], next sentence label
label = torch.tensor([1,1,0,1,0,0,1,0,0,1])
original_criterion = nn.NLLLoss(ignore_index=0)
criterion = nn.NLLLoss()
original_loss = original_criterion(out, label)
loss = criterion(out, label)

with the above code snippet, the original_loss is 0.0002, and loss is 5.0005.

I changed following code in trainer/pretrain.py:

self.criterion = nn.NLLLoss(ignore_index=0)

to:

self.criterion = nn.NLLLoss()

And as the magnitude of next_loss is smaller than mask_loss, I also over weight the next_loss, and get 58% next accuracy after train on my corpus for one epoch.

jiqiujia · 2018-10-24T12:55:34Z

Probably the criterion loss function is the problem.
# shape [10, 2], not very accurate output
out = torch.tensor([[ -8.4014,  -0.0002],
        [-10.3151,  -0.0000],
        [ -8.8440,  -0.0001],
        [ -7.5148,  -0.0005],
        [-11.0145,  -0.0000],
        [-10.9770,  -0.0000],
        [-13.3770,  -0.0000],
        [ -9.5733,  -0.0001],
        [ -9.5957,  -0.0001],
        [ -9.0712,  -0.0001]])
# shape [10], next sentence label
label = torch.tensor([1,1,0,1,0,0,1,0,0,1])
original_criterion = nn.NLLLoss(ignore_index=0)
criterion = nn.NLLLoss()
original_loss = original_criterion(out, label)
loss = criterion(out, label)
with the above code snippet, the original_loss is 0.0002, and loss is 5.0005.

I changed following code in trainer/pretrain.py:
self.criterion = nn.NLLLoss(ignore_index=0)
to:
self.criterion = nn.NLLLoss()
And as the magnitude of next_loss is smaller than mask_loss, I also over weight the next_loss, and get 58% next accuracy after train on my corpus for one epoch.

That's right. I just figure it out. Also note that for masklm, we still need ignore_index=0 since we only want to predict the masked words.

codertimo · 2018-10-25T01:00:41Z

@cairoHy Wow thank you for your smart analysis.

I just fixed this issue on 0.0.1a5 version branch. And changes is under here.

BERT-pytorch/bert_pytorch/trainer/pretrain.py

Lines 61 to 62 in 2a0b282

    
           self.masked_criterion = nn.NLLLoss(ignore_index=0) 
        
           self.next_criterion = nn.NLLLoss()

BERT-pytorch/bert_pytorch/trainer/pretrain.py

Lines 98 to 102 in 2a0b282

    
           # 2-1. NLL(negative log likelihood) loss of is_next classification result 
        
           next_loss = self.next_criterion(next_sent_output, data["is_next"]) 
        
           # 2-2. NLLLoss of predicting masked token word 
        
           mask_loss = self.masked_criterion(mask_lm_output.transpose(1, 2), data["bert_label"])

codertimo · 2018-10-25T01:06:47Z

Thanks everyone who join this investigation :)
It was totally my fault and sorry for your inconvenience during bug fixing.

Additionally, is here anyone can test the new code with your own corpus?
Any feedback would be welcome, and you can reinstall new version using under command.

git clone https://github.com/codertimo/BERT-pytorch.git
git checkout 0.0.1a5
pip install -U .

specially thanks for @jiqiujia @cairoHy @NiHaoUCAS @wenhaozheng-nju

jiqiujia · 2018-10-25T01:32:31Z

@cairoHy after the modification, the model can't converge. Any suggestions?

codertimo · 2018-10-25T01:38:15Z

@jiqiujia Can you tell me about the details? like figure or logs

jiqiujia · 2018-10-25T02:55:39Z

@codertimo The loss just don't converge

codertimo · 2018-10-26T01:15:56Z

bert-small-25-logs.txt This is the result of my 1M corpus with 1epoch, anyone can review this result

yangze01 · 2018-10-26T01:41:26Z

@codertimo Could you please show your parameters setting?

codertimo · 2018-10-26T01:51:15Z

@yangze01 just default params with batch size 128

yangze01 · 2018-10-26T01:57:12Z

@codertimo I think these code have some errors, if len(t1) is longer than seq_len, the bert_input will only contains t1. and the length of segment_label also contains only the segment label of t1

codertimo · 2018-10-26T02:00:20Z

I know but the line size of my corpus is usually less the 10 for each sentence.
And seq_len should be properly set by the user. I don't think it's the bug, and not in this thread

wenhaozheng-nju · 2018-10-26T03:24:25Z

@codertimo I think the sample of next sentence has a serious bug. Supposed 'B' is the next sentence of 'A', you may never sample the negative instance with 'A'.

codertimo · 2018-10-26T03:26:45Z

@wenhaozheng-nju I did negative sampling

BERT-pytorch/bert_pytorch/dataset/dataset.py

Lines 92 to 99 in 0d076e0

    
           def random_sent(self, index): 
        
               t1, t2 = self.get_corpus_line(index) 
        
               # output_text, label(isNotNext:0, isNext:1) 
        
               if random.random() > 0.5: 
        
                   return t1, t2, 1 
        
               else: 
        
                   return t1, self.get_random_line(), 0

BERT-pytorch/bert_pytorch/dataset/dataset.py

Lines 114 to 125 in 0d076e0

    
           def get_random_line(self): 
        
               if self.on_memory: 
        
                   return self.lines[random.randrange(len(self.lines))][1] 
        
               line = self.file.__next__() 
        
               if line is None: 
        
                   self.file.close() 
        
                   self.file = open(self.corpus_path, "r", encoding=self.encoding) 
        
                   for _ in range(random.randint(self.corpus_lines if self.corpus_lines < 1000 else 1000)): 
        
                       self.random_file.__next__() 
        
                   line = self.random_file.__next__() 
        
               return line[:-1].split("\t")[1]

wenhaozheng-nju · 2018-10-26T03:32:23Z

@codertimo Suppose the dataset is:
A \t B; B \t C; C \t D; D \t E;
After your preprocessing:
A \t B; B \t Random; C \t D; D \t Random;
The negative instance "A \t Random" may never be sampled

codertimo · 2018-10-26T03:33:52Z

@wenhaozheng-nju hmmm but do you think it's the main problem of this issue? I guess it's a model problem.

wenhaozheng-nju · 2018-10-26T03:42:35Z

@codertimo Yes, the model should sample positive and negative instance for each sentence in the sentence pair classification problem. I think that the two task are the same.

codertimo · 2018-10-26T03:45:48Z

@wenhaozheng-nju Then do you think if i change the negative sampling code as you requested, than this issue could be figure it out?

yangze01 · 2018-10-26T03:59:09Z

@codertimo I think everyone here wants to solve the problem, calm down, let's focus on the issue. @wenhaozheng-nju If you think it's the problem, you can try to modify the code and run.(but I think it's not the main problem. random negative sample is a commonly used strategy.)

codertimo · 2018-10-27T02:04:04Z

@jiqiujia I trained my dataset for 10hours last night, with dropout rate 0.0 (which is same with no dropout) and dropout rate 0.1. Unfortunately, both test loss was not coveraged.

yangze01 · 2018-10-27T02:11:46Z

@jiqiujia could you share more details? I trained with 1000000 samples, seq_len: 64, vocab_size: 100000 dropout = 0, but the result is the same as before.

jiqiujia · 2018-10-27T03:18:13Z

my parameter settings is as follows, and I set next_setence loss's weight to be 5(It should be annealed, or set to 1 I think). I only have about 10000 sentence pairs and the vocab_size is about 4000.

By the way, I also tried to test based on opennmt-py's tranformer implementation but it failed to converge. I noticed some different implementations. Transformer seems to be tricky.

jiqiujia · 2018-10-27T03:20:16Z

I've tried some varied parameters and it seems that on my dataset, these parameter doesn't have much impact. Only dropout is critical. But my dataset is rather small. I choose a small dataset just to debug. I will tried some larger datasets. Hope it's helpful. You're welcomed to share your experiments.

jiqiujia · 2018-10-27T03:30:43Z

And this is roughly the whole training log. The accuracy seems to be stuck at 81% finally.
Uploading _gaiastack_log_stdout (3).log…

Kosuke-Szk · 2018-10-27T08:06:11Z

It works well in my code.
Acc rate got over 90.0

The base of code is version 0.0.1a3.
I've changed 3 parts of this version of code.

First, set dropout off in every layers.
dropout = 0.0

Second, fix NLLLoss setting.
self.criterion = nn.NLLLoss(ignore_index=0)
to
self.criterion = nn.NLLLoss()

Third, fix prob variable setting.

prob = random.random()
if prob < 0.15:
    prob /= 0.15

    # 80% randomly change token to mask token
    if prob < 0.8:
        tokens[i] = self.vocab.mask_index

    # 10% randomly change token to random token
    elif prob < 0.9:
        tokens[i] = random.randrange(len(self.vocab))

After 999 epochs, the result as below

parameter setting is here

hidden=256
layers=8
attn_heads=8
seq_len=32
batch_size=256
epochs=1000
num_workers=5
with_cuda=True
log_freq=50
corpus_lines=None
lr=1e-4
adam_weight_decay=0.01
adam_beta1=0.9
adam_beta2=0.999
dropout=0.0

Dataset is like this

Language : Japanese
Vocab size : 4670
Sentences amount : 1000

Of course, the changes that I wrote above have been already fixed in the latest version.
But if you have not change some part of codes, It may not work well
Please check it.

codertimo · 2018-10-29T05:33:44Z

@Kosuke-Szk Thank you for sharing your result with us.
After I saw @Kosuke-Szk 's result, I thought "Isn't our model is pretty small to train..?"
As you guys know, we reduced our model to make them trainable using our GPU. And the training result was bad. However, the similar code (which is almost same with 0.0.1a4) works with smaller vocab size and dataset. So... If we make our model more bigger, than it's gonna be work? I thinks it's kind of underfitting... not just the problem of model. Anyone has idea about this issue?

wangwei7175878 · 2018-10-30T03:44:29Z

Hi there,
I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.

I set weight decay = 0, I think use one of (dropout, weight decay) is enough.

codertimo · 2018-10-30T04:05:46Z

@wangwei7175878 WOW this are brilliant, this is really huge step for us. Thank you for your effort and computation resource. Is there any result which used the weigth_decay with default? And can you share the full log as a file??

Origin corpus

How did you get the origin corpus? I tried very hard to get the corpus, but I failed... Even I sent the email to authors to get the origin corpus, but I failed. If it possible, can you share the origin corpus, so that I can test the real performance.

briandw · 2018-10-30T04:54:47Z

Hi there,
I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.

@wangwei7175878 Can you share your pre-trained model? I'm really looking froward to trying this out but I don't have that kind of processing power.

Thank you for your efforts.

wangwei7175878 · 2018-10-30T05:04:53Z

@codertimo The model can't converge use weight_decay = 0.01. My dataset is not exactly the origin corpus, but I think it is almost the same. Wiki data can easily download from https://dumps.wikimedia.org/enwiki/ and you need a web spider to get bookscorpus from [https://www.smashwords.com/](https://www.smashwords.com/

wangwei7175878 · 2018-10-30T05:11:10Z

@briandw My pre-trained model failed on downstream tasks(Fine-tune model can't converge). I will share the pre-trained model once it works.

wangwei7175878 · 2018-10-30T05:19:38Z

@codertimo Here is the whole log. It took me almost one week to train about 250000 steps. The accuracy seems to be stuck at 91% which is reported as 98% in origin paper.
log_run2_hhh_all_data_next_weight_1_no_decay.txt

codertimo · 2018-10-30T05:34:17Z

@wangwei7175878 Can you share your code for crawling and preprocessing on above issue? Or if it possible can you share the full corpus with shared drive(dropbox, google drive etc). This would be really helpful to us.

codertimo · 2018-10-30T05:35:20Z

@wangwei7175878 very interesting, authors said 0.01 weight decay is default parameter that they used. What's your parameter setting? it is same with default setting with our code except weigth_decay?

wangwei7175878 · 2018-10-31T05:50:17Z

Hi there,
I believe I fixed why model can’t converge with weight_decay = 0.01. Follow openai’s code here:
I think BERT used adamW instead of adam.
With rewriting this adam code in pytorch, my model can converge now with default setting.

codertimo · 2018-10-31T06:38:24Z

@wangwei7175878 Sounds Great!
Can you make a pull request with your adamW implementation?
I'll test it on my corpus too 👍

waynedane · 2018-11-02T01:50:28Z

I use my corpus, after three epochs, the acc rate is 73.54% .I set weight_dacay = 0. The other parameters are the default. Training continues.

shionhonda · 2019-01-11T05:22:54Z

Just for your reference.
I also confirmed the accuracy increase following @Kosuke-Szk 's suggestion.

Though the model was resized to a really small one due to the memory limitation (< 12 GB), it still worked.
Hyperparameters were:

hidden=240 #768
layers=3 #12
attn_heads=3 #12
seq_len=30 # 60
batch_size=8 #32
epochs=10
num_workers=4#5
with_cuda=True
log_freq=20
corpus_lines=None
lr=1e-3
adam_weight_decay=0.00
adam_beta1=0.9
adam_beta2=0.999
dropout=0.0
min_freq=20 #7

I used 13 GB of Wikipedia English corpus with vocabulary size of 775k.
But I stopped the job at just 2% progress of the first epoch because it said it would take thousands of hours.

zheolong · 2019-01-16T09:29:26Z

Hi there,
I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.

I set weight decay = 0, I think use one of (dropout, weight decay) is enough.

Need ur machine, system and gpu configuration, thx.

And I've also made the wiki + bookcorpus data set, will publish the docs to help for reconstruction.

zheolong · 2019-01-25T02:30:20Z

@shionhonda How do u print the accuracy for every few global steps, and finally create that curve?

shionhonda · 2019-01-25T02:46:40Z

@zheolong
The loss and accuracy is exactly what is printed on console by data_iter in pretrain.py.
Insert the following code here and plot it.

with open(FILENAME, 'a') as f:
    f.write('%d,%f,%f\n' %(i, avg_loss/(i+1), total_correct/total_element*100))

scuhz · 2020-07-23T13:46:58Z

oh my god ! I have no idea about this. I have the same result with avg_acc =50, according these methods in the issue.

codertimo added help wanted Extra attention is needed invalid This doesn't seem right question Further information is requested labels Oct 23, 2018

codertimo added a commit that referenced this issue Oct 25, 2018

Fix issue #32 miss padding issue

2a0b282

nateraw mentioned this issue Oct 26, 2018

Why doesn't the counter in data_iter increase? #39

Closed

This was referenced Oct 30, 2018

how to test the model? #38

Open

Tie the input and output embedding? #40

Open

This was referenced Oct 30, 2018

Making Wikipedia Corpus #42

Open

Making Book Corpus #43

Open

YongWookHa added a commit to YongWookHa/BERT-pytorch that referenced this issue Jun 3, 2019

edit criterion according to issue codertimo#32

eca7c19

LMC63 mentioned this issue Jan 10, 2023

why specify ignore_index=0 in the NLLLoss function in BERTTrainer? #98

Open

pred_loss decrease fast while avg_acc stay at 50% #32

pred_loss decrease fast while avg_acc stay at 50% #32

Comments

jiqiujia commented Oct 23, 2018

wenhaozheng-nju commented Oct 23, 2018

NiHaoUCAS commented Oct 23, 2018

codertimo commented Oct 23, 2018

NiHaoUCAS commented Oct 23, 2018

jiqiujia commented Oct 23, 2018

codertimo commented Oct 24, 2018

yangze01 commented Oct 24, 2018

cairoHy commented Oct 24, 2018

jiqiujia commented Oct 24, 2018

codertimo commented Oct 25, 2018

codertimo commented Oct 25, 2018 • edited Loading

jiqiujia commented Oct 25, 2018

codertimo commented Oct 25, 2018

jiqiujia commented Oct 25, 2018

codertimo commented Oct 26, 2018

yangze01 commented Oct 26, 2018

codertimo commented Oct 26, 2018

yangze01 commented Oct 26, 2018 • edited Loading

codertimo commented Oct 26, 2018

wenhaozheng-nju commented Oct 26, 2018

codertimo commented Oct 26, 2018

wenhaozheng-nju commented Oct 26, 2018

codertimo commented Oct 26, 2018

wenhaozheng-nju commented Oct 26, 2018

codertimo commented Oct 26, 2018

yangze01 commented Oct 26, 2018

codertimo commented Oct 27, 2018

yangze01 commented Oct 27, 2018

jiqiujia commented Oct 27, 2018

jiqiujia commented Oct 27, 2018 • edited Loading

jiqiujia commented Oct 27, 2018 • edited Loading

Kosuke-Szk commented Oct 27, 2018 • edited Loading

codertimo commented Oct 29, 2018

wangwei7175878 commented Oct 30, 2018

codertimo commented Oct 30, 2018 • edited Loading

Origin corpus

briandw commented Oct 30, 2018

wangwei7175878 commented Oct 30, 2018 • edited by codertimo Loading

wangwei7175878 commented Oct 30, 2018

wangwei7175878 commented Oct 30, 2018

codertimo commented Oct 30, 2018

codertimo commented Oct 30, 2018

wangwei7175878 commented Oct 31, 2018

codertimo commented Oct 31, 2018

waynedane commented Nov 2, 2018

shionhonda commented Jan 11, 2019

zheolong commented Jan 16, 2019 • edited Loading

zheolong commented Jan 25, 2019

shionhonda commented Jan 25, 2019

scuhz commented Jul 23, 2020

codertimo commented Oct 25, 2018 •

edited

Loading

yangze01 commented Oct 26, 2018 •

edited

Loading

jiqiujia commented Oct 27, 2018 •

edited

Loading

jiqiujia commented Oct 27, 2018 •

edited

Loading

Kosuke-Szk commented Oct 27, 2018 •

edited

Loading

codertimo commented Oct 30, 2018 •

edited

Loading

wangwei7175878 commented Oct 30, 2018 •

edited by codertimo

Loading

zheolong commented Jan 16, 2019 •

edited

Loading