Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context across sentences, by mistake? #13

Open
joelb-git opened this issue Jun 22, 2017 · 3 comments
Open

Context across sentences, by mistake? #13

joelb-git opened this issue Jun 22, 2017 · 3 comments

Comments

@joelb-git
Copy link

joelb-git commented Jun 22, 2017

SortVocab is removing the sentence end marker "</s>" from the index 0
in the vocab. I think the intent of the original word2vec code is
that newlines are replaced with the "</s>" token, which is found as 0
in the vocab. Then context does not cross sentences. However,
because of this problem, looking up "</s>" actually returns -1, an OOV
word, and we end up with each "sentence" filling the max 1000 word
buffer.

I added printf statements before and after the call to SortVocab and
ran on trivial input to demonstrate.

[~/views/word2vec (master *)]
$ git log | head -1
commit 80be14a89b260df5cfca19a65cbfe52ba15db7ba

$ git diff
diff --git a/src/word2vec.c b/src/word2vec.c
index 2f892ea..7bd6392 100644
--- a/src/word2vec.c
+++ b/src/word2vec.c
@@ -309,7 +309,11 @@ void LearnVocabFromTrainFile() {
     } else vocab[i].cn++;
     if (vocab_size > vocab_hash_size * 0.7) ReduceVocab();
   }
+
+  printf("before: </s> index = %d\n", SearchVocab("</s>"));
   SortVocab();
+  printf("after:  </s> index = %d\n", SearchVocab("</s>"));
+
   if (debug_mode > 0) {
     printf("Vocab size: %lld\n", vocab_size);
     printf("Words in train file: %lld\n", train_words);

$ make -C src
...
$ echo foo bar baz >in.txt
$ bin/word2vec -train in.txt
Starting training using file in.txt
before: </s> index = 0
after:  </s> index = -1   <------- oops!
Vocab size: 1
Words in train file: 0

I also verified that the original word2vec code did not have this
problem.

@Simsso
Copy link

Simsso commented May 28, 2019

Hey @joelb-git, just discovered that. Did you find out more in the meantime?

@joelb-git
Copy link
Author

Hi @Simsso - no, I had no response on this. This was a while ago. I think I ended up just using the original code instead, at https://github.com/tmikolov/word2vec.git

@Simsso
Copy link

Simsso commented May 28, 2019

Thx, will do the same!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants