You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SortVocab is removing the sentence end marker "</s>" from the index 0
in the vocab. I think the intent of the original word2vec code is
that newlines are replaced with the "</s>" token, which is found as 0
in the vocab. Then context does not cross sentences. However,
because of this problem, looking up "</s>" actually returns -1, an OOV
word, and we end up with each "sentence" filling the max 1000 word
buffer.
I added printf statements before and after the call to SortVocab and
ran on trivial input to demonstrate.
[~/views/word2vec (master *)]
$ git log | head -1
commit 80be14a89b260df5cfca19a65cbfe52ba15db7ba
$ git diff
diff --git a/src/word2vec.c b/src/word2vec.c
index 2f892ea..7bd6392 100644
--- a/src/word2vec.c
+++ b/src/word2vec.c
@@ -309,7 +309,11 @@ void LearnVocabFromTrainFile() {
} else vocab[i].cn++;
if (vocab_size > vocab_hash_size * 0.7) ReduceVocab();
}
+
+ printf("before: </s> index = %d\n", SearchVocab("</s>"));
SortVocab();
+ printf("after: </s> index = %d\n", SearchVocab("</s>"));
+
if (debug_mode > 0) {
printf("Vocab size: %lld\n", vocab_size);
printf("Words in train file: %lld\n", train_words);
$ make -C src
...
$ echo foo bar baz >in.txt
$ bin/word2vec -train in.txt
Starting training using file in.txt
before: </s> index = 0
after: </s> index = -1 <------- oops!
Vocab size: 1
Words in train file: 0
I also verified that the original word2vec code did not have this
problem.
The text was updated successfully, but these errors were encountered:
SortVocab
is removing the sentence end marker "</s>" from the index 0in the vocab. I think the intent of the original word2vec code is
that newlines are replaced with the "</s>" token, which is found as 0
in the vocab. Then context does not cross sentences. However,
because of this problem, looking up "</s>" actually returns -1, an OOV
word, and we end up with each "sentence" filling the max 1000 word
buffer.
I added printf statements before and after the call to
SortVocab
andran on trivial input to demonstrate.
I also verified that the original word2vec code did not have this
problem.
The text was updated successfully, but these errors were encountered: