Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shall I use cosine or dot product to calculate the similarity between user & item? #17

Open
jackyhawk opened this issue May 30, 2018 · 11 comments

Comments

@jackyhawk
Copy link

Shall I user consine or dot product to calculate the similarity between user & item?

(since there would be some negative latent fatore for user & item, is it still suitable to cosine?

thanks very much

@chtran
Copy link

chtran commented May 30, 2018

Hi, if you care only about similarity, not popularity, you can use cosine similarity. If you care about popularity of items, you can use dot product

@jackyhawk jackyhawk changed the title Shall I user consine or dot product to calculate the similarity between user & item? Shall I use consine or dot product to calculate the similarity between user & item? May 31, 2018
@jackyhawk
Copy link
Author

jackyhawk commented May 31, 2018

Thanks very much,chtran.

And what's more ,shall I use the item bias when I calculate the score?

would the result be like this:

score = user_factor dot_product item_factor + item_bias

or just:
score = user_factor dot_product item_factor

?

@jackyhawk
Copy link
Author

And btw, it seemed that qmf does not support user bias?

@jackyhawk
Copy link
Author

Any suggestions for this? Thanks very much

@albietz
Copy link
Contributor

albietz commented Jun 1, 2018

Hi @jackyhawk,

Yes, if you train your model with item biases, you should also use them when making predictions.
qmf does not have user biases because the models try to predict preferences / rankings for each user, as opposed to absolute scores (like ratings in the netflix dataset), so adding a user offset does not change anything.

-Alberto

@jackyhawk
Copy link
Author

jackyhawk commented Jun 1, 2018

Thanks very much albietz.

and it would be like this?
score = user_factor dot_product item_factor + item_bias
or
score = cosine(user_factor , item_factor) + item_bias

And i've just paste the test metrics of my real training output( about 5 million,0.5 million items, and about 0.5 billion click,which are our users behavior within latest 30 days of (after filtering some outliers) )
it seems that the auc is very high ,but the precision and recall (@10) is very low,
is this the normal scenario?

And what about the train loss? it seems that value of [train loss] betweem 0.05~0.08 works for me.
but it is not good for [test loss = 0.223691]

====================================================
test metrics:
train loss = 0.0759575, test loss = 0.223691

18:29:02.830051 26919 MetricsEngine.cpp:41] begin metrics: epoch 9: recorded metric test_avg_auc = 0.91341,log_:1
18:29:02.830128 26919 MetricsEngine.cpp:45] epoch 9: recorded metric test_avg_auc = 0.91341
18:29:11.645699 26919 MetricsEngine.cpp:41] begin metrics: epoch 9: recorded metric test_avg_ap = 0.00194532,log_:1
18:29:11.645777 26919 MetricsEngine.cpp:45] epoch 9: recorded metric test_avg_ap = 0.00194532
18:29:14.434798 26919 MetricsEngine.cpp:41] begin metrics: epoch 9: recorded metric test_avg_p@10 = 0.0016,log_:1
18:29:14.434895 26919 MetricsEngine.cpp:45] epoch 9: recorded metric test_avg_p@10 = 0.0016
18:29:17.357926 26919 MetricsEngine.cpp:41] begin metrics: epoch 9: recorded metric test_avg_r@10 = 0.00135481,log_:1
18:29:17.358002 26919 MetricsEngine.cpp:45] epoch 9: recorded metric test_avg_r@10 = 0.00135481

@jackyhawk
Copy link
Author

And what's more, as for the input data file, should the old data in the beginning or the new data in the beginning? Thanks very much

@jackyhawk jackyhawk changed the title Shall I use consine or dot product to calculate the similarity between user & item? Shall I use cosine or dot product to calculate the similarity between user & item? Jun 2, 2018
@jackyhawk
Copy link
Author

jackyhawk commented Jun 2, 2018

And what's more, Shall I use more (longer-period user's behaviro ) data as the training dataset to improve the precison & recall @10?

and currently, i just use 100 as the nfactor dimension, shall I also increase that value?

Thanks very much

@jackyhawk
Copy link
Author

Any suggestions for that? Thanks very much

@jackyhawk
Copy link
Author

I tried to increase the dimension of latent factor (from 100 to 200 and then to 300), it seemed that the results got better

@jackyhawk
Copy link
Author

Any suggestions for this? Thanks very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants