Skip to content

Commit 98aa1d8

Browse files
DongjiGaodanpovey
authored andcommitted
[egs] fixed bug in egs/gale_arabic/s5c/local/prepare_dict_subword.sh that it may delete words matching '<*>' (#3465)
1 parent b5385b4 commit 98aa1d8

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

egs/gale_arabic/s5c/local/prepare_dict_subword.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ glossaries="<UNK> <sil>"
4848
if [ $stage -le 0 ]; then
4949
echo "$0: making subword lexicon... $(date)."
5050
# get pair_code file
51-
cut -d ' ' -f2- data/train/text | sed 's/<[^>]*>//g' | utils/lang/bpe/learn_bpe.py -s $num_merges > data/local/pair_code.txt
51+
cut -d ' ' -f2- data/train/text | sed 's/<sil>//g;s/<UNK>//g' | utils/lang/bpe/learn_bpe.py -s $num_merges > data/local/pair_code.txt
5252
mv $dir/lexicon.txt $dir/lexicon_word.txt
5353
# get words
5454
cut -d ' ' -f1 $dir/lexicon_word.txt > $dir/words.txt

0 commit comments

Comments
 (0)