Context across sentences, by mistake? #13

joelb-git · 2017-06-22T21:57:24Z

SortVocab is removing the sentence end marker "</s>" from the index 0
in the vocab. I think the intent of the original word2vec code is
that newlines are replaced with the "</s>" token, which is found as 0
in the vocab. Then context does not cross sentences. However,
because of this problem, looking up "</s>" actually returns -1, an OOV
word, and we end up with each "sentence" filling the max 1000 word
buffer.

I added printf statements before and after the call to SortVocab and
ran on trivial input to demonstrate.

[~/views/word2vec (master *)]
$ git log | head -1
commit 80be14a89b260df5cfca19a65cbfe52ba15db7ba

$ git diff
diff --git a/src/word2vec.c b/src/word2vec.c
index 2f892ea..7bd6392 100644
--- a/src/word2vec.c
+++ b/src/word2vec.c
@@ -309,7 +309,11 @@ void LearnVocabFromTrainFile() {
     } else vocab[i].cn++;
     if (vocab_size > vocab_hash_size * 0.7) ReduceVocab();
   }
+
+  printf("before: </s> index = %d\n", SearchVocab("</s>"));
   SortVocab();
+  printf("after:  </s> index = %d\n", SearchVocab("</s>"));
+
   if (debug_mode > 0) {
     printf("Vocab size: %lld\n", vocab_size);
     printf("Words in train file: %lld\n", train_words);

$ make -C src
...
$ echo foo bar baz >in.txt
$ bin/word2vec -train in.txt
Starting training using file in.txt
before: </s> index = 0
after:  </s> index = -1   <------- oops!
Vocab size: 1
Words in train file: 0

I also verified that the original word2vec code did not have this
problem.

The text was updated successfully, but these errors were encountered:

Simsso · 2019-05-28T14:48:49Z

Hey @joelb-git, just discovered that. Did you find out more in the meantime?

joelb-git · 2019-05-28T15:01:01Z

Hi @Simsso - no, I had no response on this. This was a while ago. I think I ended up just using the original code instead, at https://github.com/tmikolov/word2vec.git

Simsso · 2019-05-28T15:05:46Z

Thx, will do the same!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context across sentences, by mistake? #13

Context across sentences, by mistake? #13

joelb-git commented Jun 22, 2017 •

edited

Loading

Simsso commented May 28, 2019 •

edited

Loading

joelb-git commented May 28, 2019

Simsso commented May 28, 2019

Context across sentences, by mistake? #13

Context across sentences, by mistake? #13

Comments

joelb-git commented Jun 22, 2017 • edited Loading

Simsso commented May 28, 2019 • edited Loading

joelb-git commented May 28, 2019

Simsso commented May 28, 2019

joelb-git commented Jun 22, 2017 •

edited

Loading

Simsso commented May 28, 2019 •

edited

Loading