-
Notifications
You must be signed in to change notification settings - Fork 1.3k
pred_loss decrease fast while avg_acc stay at 50% #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I also meet the same problem in small dataset. |
me too |
Hmm interesting.. Is this the result of 0.0.1a4 version? |
0.0.01a3 vesion |
I try using 0.0.1a4 and the result is the same |
Hmmm... anyone have any clues? |
Probably the criterion loss function is the problem. # shape [10, 2], not very accurate output
out = torch.tensor([[ -8.4014, -0.0002],
[-10.3151, -0.0000],
[ -8.8440, -0.0001],
[ -7.5148, -0.0005],
[-11.0145, -0.0000],
[-10.9770, -0.0000],
[-13.3770, -0.0000],
[ -9.5733, -0.0001],
[ -9.5957, -0.0001],
[ -9.0712, -0.0001]])
# shape [10], next sentence label
label = torch.tensor([1,1,0,1,0,0,1,0,0,1])
original_criterion = nn.NLLLoss(ignore_index=0)
criterion = nn.NLLLoss()
original_loss = original_criterion(out, label)
loss = criterion(out, label) with the above code snippet, the original_loss is 0.0002, and loss is 5.0005. I changed following code in
to:
And as the magnitude of next_loss is smaller than mask_loss, I also over weight the next_loss, and get 58% next accuracy after train on my corpus for one epoch. |
That's right. I just figure it out. Also note that for masklm, we still need |
@cairoHy Wow thank you for your smart analysis. I just fixed this issue on 0.0.1a5 version branch. And changes is under here. BERT-pytorch/bert_pytorch/trainer/pretrain.py Lines 61 to 62 in 2a0b282
BERT-pytorch/bert_pytorch/trainer/pretrain.py Lines 98 to 102 in 2a0b282
|
Thanks everyone who join this investigation :) Additionally, is here anyone can test the new code with your own corpus? git clone https://github.com/codertimo/BERT-pytorch.git
git checkout 0.0.1a5
pip install -U . specially thanks for @jiqiujia @cairoHy @NiHaoUCAS @wenhaozheng-nju |
@cairoHy after the modification, the model can't converge. Any suggestions? |
@jiqiujia Can you tell me about the details? like figure or logs |
@codertimo The loss just don't converge |
bert-small-25-logs.txt This is the result of my 1M corpus with 1epoch, anyone can review this result |
@codertimo Could you please show your parameters setting? |
@yangze01 just default params with batch size 128 |
@codertimo I think these code have some errors, if len(t1) is longer than seq_len, the bert_input will only contains t1. and the length of segment_label also contains only the segment label of t1 |
I know but the line size of my corpus is usually less the 10 for each sentence. |
@codertimo I think the sample of next sentence has a serious bug. Supposed 'B' is the next sentence of 'A', you may never sample the negative instance with 'A'. |
@wenhaozheng-nju I did negative sampling BERT-pytorch/bert_pytorch/dataset/dataset.py Lines 92 to 99 in 0d076e0
BERT-pytorch/bert_pytorch/dataset/dataset.py Lines 114 to 125 in 0d076e0
|
@codertimo Suppose the dataset is: |
@wenhaozheng-nju hmmm but do you think it's the main problem of this issue? I guess it's a model problem. |
@codertimo Yes, the model should sample positive and negative instance for each sentence in the sentence pair classification problem. I think that the two task are the same. |
@wenhaozheng-nju Then do you think if i change the negative sampling code as you requested, than this issue could be figure it out? |
@codertimo I think everyone here wants to solve the problem, calm down, let's focus on the issue. @wenhaozheng-nju If you think it's the problem, you can try to modify the code and run.(but I think it's not the main problem. random negative sample is a commonly used strategy.) |
@jiqiujia I trained my dataset for 10hours last night, with dropout rate 0.0 (which is same with no dropout) and dropout rate 0.1. Unfortunately, both test loss was not coveraged. |
@jiqiujia could you share more details? I trained with 1000000 samples, seq_len: 64, vocab_size: 100000 dropout = 0, but the result is the same as before. |
I've tried some varied parameters and it seems that on my dataset, these parameter doesn't have much impact. Only dropout is critical. But my dataset is rather small. I choose a small dataset just to debug. I will tried some larger datasets. Hope it's helpful. You're welcomed to share your experiments. |
And this is roughly the whole training log. The accuracy seems to be stuck at 81% finally. |
@Kosuke-Szk Thank you for sharing your result with us. |
@wangwei7175878 WOW this are brilliant, this is really huge step for us. Thank you for your effort and computation resource. Is there any result which used the Origin corpusHow did you get the origin corpus? I tried very hard to get the corpus, but I failed... Even I sent the email to authors to get the origin corpus, but I failed. If it possible, can you share the origin corpus, so that I can test the real performance. |
@wangwei7175878 Can you share your pre-trained model? I'm really looking froward to trying this out but I don't have that kind of processing power. Thank you for your efforts. |
@codertimo The model can't converge use weight_decay = 0.01. My dataset is not exactly the origin corpus, but I think it is almost the same. Wiki data can easily download from https://dumps.wikimedia.org/enwiki/ and you need a web spider to get bookscorpus from [https://www.smashwords.com/](https://www.smashwords.com/ |
@briandw My pre-trained model failed on downstream tasks(Fine-tune model can't converge). I will share the pre-trained model once it works. |
@codertimo Here is the whole log. It took me almost one week to train about 250000 steps. The accuracy seems to be stuck at 91% which is reported as 98% in origin paper. |
@wangwei7175878 Can you share your code for crawling and preprocessing on above issue? Or if it possible can you share the full corpus with shared drive(dropbox, google drive etc). This would be really helpful to us. |
@wangwei7175878 very interesting, authors said 0.01 weight decay is default parameter that they used. What's your parameter setting? it is same with default setting with our code except weigth_decay? |
Hi there, |
@wangwei7175878 Sounds Great! |
I use my corpus, after three epochs, the acc rate is 73.54% .I set weight_dacay = 0. The other parameters are the default. Training continues. |
Just for your reference. Though the model was resized to a really small one due to the memory limitation (< 12 GB), it still worked.
I used 13 GB of Wikipedia English corpus with vocabulary size of 775k. |
@shionhonda How do u print the accuracy for every few global steps, and finally create that curve? |
oh my god ! I have no idea about this. I have the same result with avg_acc =50, according these methods in the issue. |
I try to run the code on a small dataset and I find that pred_loss decrease fast while avg_acc stay at 50%. It is strange to me since decrease in pred_loss should indicates increase in accuracy.

The text was updated successfully, but these errors were encountered: