Skip to content

pred_loss decrease fast while avg_acc stay at 50% #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jiqiujia opened this issue Oct 23, 2018 · 53 comments
Open

pred_loss decrease fast while avg_acc stay at 50% #32

jiqiujia opened this issue Oct 23, 2018 · 53 comments
Labels
help wanted Extra attention is needed invalid This doesn't seem right question Further information is requested

Comments

@jiqiujia
Copy link

I try to run the code on a small dataset and I find that pred_loss decrease fast while avg_acc stay at 50%. It is strange to me since decrease in pred_loss should indicates increase in accuracy.
image

@wenhaozheng-nju
Copy link

I also meet the same problem in small dataset.

@NiHaoUCAS
Copy link

me too

@codertimo
Copy link
Owner

Hmm interesting.. Is this the result of 0.0.1a4 version?
And How did you guys print out that result?

@NiHaoUCAS
Copy link

Hmm interesting.. Is this the result of 0.0.1a4 version?
And How did you guys print out that result?

0.0.01a3 vesion
the result is print out by bert cmd , no any modify.

@codertimo codertimo added help wanted Extra attention is needed invalid This doesn't seem right question Further information is requested labels Oct 23, 2018
@jiqiujia
Copy link
Author

Hmm interesting.. Is this the result of 0.0.1a4 version?
And How did you guys print out that result?

I try using 0.0.1a4 and the result is the same

@codertimo
Copy link
Owner

Hmmm... anyone have any clues?

@yangze01
Copy link

I try using different data: continuous sentence pair from same document, concat continuous sentence as longer sentence , query and document pair, the result is the same. I also found that there is a big gap between next_loss and mask_loss although they use the same loss function.
image

@cairoHy
Copy link

cairoHy commented Oct 24, 2018

Probably the criterion loss function is the problem.

# shape [10, 2], not very accurate output
out = torch.tensor([[ -8.4014,  -0.0002],
        [-10.3151,  -0.0000],
        [ -8.8440,  -0.0001],
        [ -7.5148,  -0.0005],
        [-11.0145,  -0.0000],
        [-10.9770,  -0.0000],
        [-13.3770,  -0.0000],
        [ -9.5733,  -0.0001],
        [ -9.5957,  -0.0001],
        [ -9.0712,  -0.0001]])
# shape [10], next sentence label
label = torch.tensor([1,1,0,1,0,0,1,0,0,1])
original_criterion = nn.NLLLoss(ignore_index=0)
criterion = nn.NLLLoss()
original_loss = original_criterion(out, label)
loss = criterion(out, label)

with the above code snippet, the original_loss is 0.0002, and loss is 5.0005.

I changed following code in trainer/pretrain.py:

self.criterion = nn.NLLLoss(ignore_index=0)

to:

self.criterion = nn.NLLLoss()

And as the magnitude of next_loss is smaller than mask_loss, I also over weight the next_loss, and get 58% next accuracy after train on my corpus for one epoch.

@jiqiujia
Copy link
Author

Probably the criterion loss function is the problem.

# shape [10, 2], not very accurate output
out = torch.tensor([[ -8.4014,  -0.0002],
        [-10.3151,  -0.0000],
        [ -8.8440,  -0.0001],
        [ -7.5148,  -0.0005],
        [-11.0145,  -0.0000],
        [-10.9770,  -0.0000],
        [-13.3770,  -0.0000],
        [ -9.5733,  -0.0001],
        [ -9.5957,  -0.0001],
        [ -9.0712,  -0.0001]])
# shape [10], next sentence label
label = torch.tensor([1,1,0,1,0,0,1,0,0,1])
original_criterion = nn.NLLLoss(ignore_index=0)
criterion = nn.NLLLoss()
original_loss = original_criterion(out, label)
loss = criterion(out, label)

with the above code snippet, the original_loss is 0.0002, and loss is 5.0005.

I changed following code in trainer/pretrain.py:

self.criterion = nn.NLLLoss(ignore_index=0)

to:

self.criterion = nn.NLLLoss()

And as the magnitude of next_loss is smaller than mask_loss, I also over weight the next_loss, and get 58% next accuracy after train on my corpus for one epoch.

That's right. I just figure it out. Also note that for masklm, we still need ignore_index=0 since we only want to predict the masked words.

codertimo added a commit that referenced this issue Oct 25, 2018
@codertimo
Copy link
Owner

@cairoHy Wow thank you for your smart analysis.

I just fixed this issue on 0.0.1a5 version branch. And changes is under here.

self.masked_criterion = nn.NLLLoss(ignore_index=0)
self.next_criterion = nn.NLLLoss()

# 2-1. NLL(negative log likelihood) loss of is_next classification result
next_loss = self.next_criterion(next_sent_output, data["is_next"])
# 2-2. NLLLoss of predicting masked token word
mask_loss = self.masked_criterion(mask_lm_output.transpose(1, 2), data["bert_label"])

@codertimo
Copy link
Owner

codertimo commented Oct 25, 2018

Thanks everyone who join this investigation :)
It was totally my fault and sorry for your inconvenience during bug fixing.

Additionally, is here anyone can test the new code with your own corpus?
Any feedback would be welcome, and you can reinstall new version using under command.

git clone https://github.com/codertimo/BERT-pytorch.git
git checkout 0.0.1a5
pip install -U .

specially thanks for @jiqiujia @cairoHy @NiHaoUCAS @wenhaozheng-nju

@jiqiujia
Copy link
Author

@cairoHy after the modification, the model can't converge. Any suggestions?

@codertimo
Copy link
Owner

@jiqiujia Can you tell me about the details? like figure or logs

@jiqiujia
Copy link
Author

@codertimo The loss just don't converge
image

@codertimo
Copy link
Owner

bert-small-25-logs.txt This is the result of my 1M corpus with 1epoch, anyone can review this result

@yangze01
Copy link

@codertimo Could you please show your parameters setting?

@codertimo
Copy link
Owner

@yangze01 just default params with batch size 128

@yangze01
Copy link

yangze01 commented Oct 26, 2018

@codertimo I think these code have some errors, if len(t1) is longer than seq_len, the bert_input will only contains t1. and the length of segment_label also contains only the segment label of t1
image

@codertimo
Copy link
Owner

I know but the line size of my corpus is usually less the 10 for each sentence.
And seq_len should be properly set by the user. I don't think it's the bug, and not in this thread

@wenhaozheng-nju
Copy link

@codertimo I think the sample of next sentence has a serious bug. Supposed 'B' is the next sentence of 'A', you may never sample the negative instance with 'A'.

@codertimo
Copy link
Owner

@wenhaozheng-nju I did negative sampling

def random_sent(self, index):
t1, t2 = self.get_corpus_line(index)
# output_text, label(isNotNext:0, isNext:1)
if random.random() > 0.5:
return t1, t2, 1
else:
return t1, self.get_random_line(), 0

def get_random_line(self):
if self.on_memory:
return self.lines[random.randrange(len(self.lines))][1]
line = self.file.__next__()
if line is None:
self.file.close()
self.file = open(self.corpus_path, "r", encoding=self.encoding)
for _ in range(random.randint(self.corpus_lines if self.corpus_lines < 1000 else 1000)):
self.random_file.__next__()
line = self.random_file.__next__()
return line[:-1].split("\t")[1]

@wenhaozheng-nju
Copy link

@codertimo Suppose the dataset is:
A \t B; B \t C; C \t D; D \t E;
After your preprocessing:
A \t B; B \t Random; C \t D; D \t Random;
The negative instance "A \t Random" may never be sampled

@codertimo
Copy link
Owner

@wenhaozheng-nju hmmm but do you think it's the main problem of this issue? I guess it's a model problem.

@wenhaozheng-nju
Copy link

@codertimo Yes, the model should sample positive and negative instance for each sentence in the sentence pair classification problem. I think that the two task are the same.

@codertimo
Copy link
Owner

@wenhaozheng-nju Then do you think if i change the negative sampling code as you requested, than this issue could be figure it out?

@yangze01
Copy link

@codertimo I think everyone here wants to solve the problem, calm down, let's focus on the issue. @wenhaozheng-nju If you think it's the problem, you can try to modify the code and run.(but I think it's not the main problem. random negative sample is a commonly used strategy.)

@codertimo
Copy link
Owner

@jiqiujia I trained my dataset for 10hours last night, with dropout rate 0.0 (which is same with no dropout) and dropout rate 0.1. Unfortunately, both test loss was not coveraged.
2018-10-27 10 57 02

@yangze01
Copy link

@jiqiujia could you share more details? I trained with 1000000 samples, seq_len: 64, vocab_size: 100000 dropout = 0, but the result is the same as before.

@jiqiujia
Copy link
Author

my parameter settings is as follows, and I set next_setence loss's weight to be 5(It should be annealed, or set to 1 I think). I only have about 10000 sentence pairs and the vocab_size is about 4000.
image
By the way, I also tried to test based on opennmt-py's tranformer implementation but it failed to converge. I noticed some different implementations. Transformer seems to be tricky.

@jiqiujia
Copy link
Author

jiqiujia commented Oct 27, 2018

I've tried some varied parameters and it seems that on my dataset, these parameter doesn't have much impact. Only dropout is critical. But my dataset is rather small. I choose a small dataset just to debug. I will tried some larger datasets. Hope it's helpful. You're welcomed to share your experiments.

@jiqiujia
Copy link
Author

jiqiujia commented Oct 27, 2018

And this is roughly the whole training log. The accuracy seems to be stuck at 81% finally.
Uploading _gaiastack_log_stdout (3).log…

@Kosuke-Szk
Copy link

Kosuke-Szk commented Oct 27, 2018

It works well in my code.
Acc rate got over 90.0

The base of code is version 0.0.1a3.
I've changed 3 parts of this version of code.

First, set dropout off in every layers.
dropout = 0.0

Second, fix NLLLoss setting.
self.criterion = nn.NLLLoss(ignore_index=0)
to
self.criterion = nn.NLLLoss()

Third, fix prob variable setting.

prob = random.random()
if prob < 0.15:
    prob /= 0.15

    # 80% randomly change token to mask token
    if prob < 0.8:
        tokens[i] = self.vocab.mask_index

    # 10% randomly change token to random token
    elif prob < 0.9:
        tokens[i] = random.randrange(len(self.vocab))

After 999 epochs, the result as below
2018-10-27 17 01 29

parameter setting is here

hidden=256
layers=8
attn_heads=8
seq_len=32
batch_size=256
epochs=1000
num_workers=5
with_cuda=True
log_freq=50
corpus_lines=None
lr=1e-4
adam_weight_decay=0.01
adam_beta1=0.9
adam_beta2=0.999
dropout=0.0

Dataset is like this

Language : Japanese
Vocab size : 4670
Sentences amount : 1000

Of course, the changes that I wrote above have been already fixed in the latest version.
But if you have not change some part of codes, It may not work well
Please check it.

@codertimo
Copy link
Owner

@Kosuke-Szk Thank you for sharing your result with us.
After I saw @Kosuke-Szk 's result, I thought "Isn't our model is pretty small to train..?"
As you guys know, we reduced our model to make them trainable using our GPU. And the training result was bad. However, the similar code (which is almost same with 0.0.1a4) works with smaller vocab size and dataset. So... If we make our model more bigger, than it's gonna be work? I thinks it's kind of underfitting... not just the problem of model. Anyone has idea about this issue?

@wangwei7175878
Copy link

Hi there,
I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.
2018-10-30 11 40 25
I set weight decay = 0, I think use one of (dropout, weight decay) is enough.

@codertimo
Copy link
Owner

codertimo commented Oct 30, 2018

@wangwei7175878 WOW this are brilliant, this is really huge step for us. Thank you for your effort and computation resource. Is there any result which used the weigth_decay with default? And can you share the full log as a file??

Origin corpus

How did you get the origin corpus? I tried very hard to get the corpus, but I failed... Even I sent the email to authors to get the origin corpus, but I failed. If it possible, can you share the origin corpus, so that I can test the real performance.

@briandw
Copy link

briandw commented Oct 30, 2018

Hi there,
I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.

@wangwei7175878 Can you share your pre-trained model? I'm really looking froward to trying this out but I don't have that kind of processing power.

Thank you for your efforts.

@wangwei7175878
Copy link

wangwei7175878 commented Oct 30, 2018

@codertimo The model can't converge use weight_decay = 0.01. My dataset is not exactly the origin corpus, but I think it is almost the same. Wiki data can easily download from https://dumps.wikimedia.org/enwiki/ and you need a web spider to get bookscorpus from [https://www.smashwords.com/](https://www.smashwords.com/

@wangwei7175878
Copy link

@briandw My pre-trained model failed on downstream tasks(Fine-tune model can't converge). I will share the pre-trained model once it works.

@wangwei7175878
Copy link

@codertimo Here is the whole log. It took me almost one week to train about 250000 steps. The accuracy seems to be stuck at 91% which is reported as 98% in origin paper.
log_run2_hhh_all_data_next_weight_1_no_decay.txt

This was referenced Oct 30, 2018
@codertimo
Copy link
Owner

@wangwei7175878 Can you share your code for crawling and preprocessing on above issue? Or if it possible can you share the full corpus with shared drive(dropbox, google drive etc). This would be really helpful to us.

@codertimo
Copy link
Owner

@wangwei7175878 very interesting, authors said 0.01 weight decay is default parameter that they used. What's your parameter setting? it is same with default setting with our code except weigth_decay?

@wangwei7175878
Copy link

Hi there,
I believe I fixed why model can’t converge with weight_decay = 0.01. Follow openai’s code here:
I think BERT used adamW instead of adam.
With rewriting this adam code in pytorch, my model can converge now with default setting.

@codertimo
Copy link
Owner

@wangwei7175878 Sounds Great!
Can you make a pull request with your adamW implementation?
I'll test it on my corpus too 👍

@waynedane
Copy link

I use my corpus, after three epochs, the acc rate is 73.54% .I set weight_dacay = 0. The other parameters are the default. Training continues.

@shionhonda
Copy link

Just for your reference.
I also confirmed the accuracy increase following @Kosuke-Szk 's suggestion.
loss
acc

Though the model was resized to a really small one due to the memory limitation (< 12 GB), it still worked.
Hyperparameters were:

hidden=240 #768
layers=3 #12
attn_heads=3 #12
seq_len=30 # 60
batch_size=8 #32
epochs=10
num_workers=4#5
with_cuda=True
log_freq=20
corpus_lines=None
lr=1e-3
adam_weight_decay=0.00
adam_beta1=0.9
adam_beta2=0.999
dropout=0.0
min_freq=20 #7

I used 13 GB of Wikipedia English corpus with vocabulary size of 775k.
But I stopped the job at just 2% progress of the first epoch because it said it would take thousands of hours.

@zheolong
Copy link

zheolong commented Jan 16, 2019

Hi there,
I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.
2018-10-30 11 40 25
I set weight decay = 0, I think use one of (dropout, weight decay) is enough.

Need ur machine, system and gpu configuration, thx.

And I've also made the wiki + bookcorpus data set, will publish the docs to help for reconstruction.

@zheolong
Copy link

@shionhonda How do u print the accuracy for every few global steps, and finally create that curve?

@shionhonda
Copy link

@zheolong
The loss and accuracy is exactly what is printed on console by data_iter in pretrain.py.
Insert the following code here and plot it.

with open(FILENAME, 'a') as f:
    f.write('%d,%f,%f\n' %(i, avg_loss/(i+1), total_correct/total_element*100))

YongWookHa added a commit to YongWookHa/BERT-pytorch that referenced this issue Jun 3, 2019
@scuhz
Copy link

scuhz commented Jul 23, 2020

oh my god ! I have no idea about this. I have the same result with avg_acc =50, according these methods in the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed invalid This doesn't seem right question Further information is requested
Projects
None yet
Development

No branches or pull requests