Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add tfidf bm25 #2353
base: branch-24.12
Are you sure you want to change the base?
[REVIEW] Add tfidf bm25 #2353
Changes from 50 commits
a6677ca
ad2d7d7
e3c9344
3b0a6d2
309ea1a
3740998
e987ec8
0b55c32
229b9f8
0eded98
3e5a625
ad50a7f
ed2c529
aae5e34
87a7d16
1de93ba
31ae597
08abc72
c6e6ce8
f7d2335
c16fa56
9a716b7
60936ba
a655c9a
9a66f42
69dce2d
1467154
7d1057e
dc800d6
520e12c
f626bf1
c931b61
af1515d
9147c90
59ae9d6
7dd2f6d
5797ef5
e588d7b
51f52c1
afdddfb
e9f9aa8
599651e
9e2d627
1143113
698d6c7
e0d40e5
fa44bcc
41938c4
63a506d
427ea26
ffbfbc7
2d82aca
dc01bc1
6f4745d
987ff5e
c46008c
81bb89d
ff1991f
c593f4e
0febb55
6477cd4
c836ba8
ce8253e
442cd7a
b1720c7
3365ec3
06b6df2
034d2c5
04bb007
3747291
281a029
3d66d4b
2b70436
84ffc8b
63607bd
0f462a9
1fc27f3
82cfb1f
1155609
6302957
05f4af2
a1e3a48
44f3e1c
e25e2de
187e148
ec4e4a2
e6d2c1c
5120c97
81e2a41
90373ab
87a729c
63576b0
c123acb
29f14d9
0ca6e10
b000065
3507771
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to remove all these imports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally you just import what you need, so if you need all of these then go ahead and import them. Otherwise, try to remove things that are unneeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created these structs to condense the logic into a single map call for preprocessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up using the thrust version because it could handle vectors, which allows me to use the same code for both the csr and coo matrix versions of the encoding logic. Also the raft version does not support sparse matrix versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to compute the degree of each row in the sparse format? We have routines for this already. We have a
coo_degree
function here. Degree computation for CSR is actually really trivial- since you already have an array of offsets, you don't even need to count the columns because you can literally just diff the array (e.g. compute the difference between each value in the indptr array and the value that occurred before it). If you can't guarantee uniqueness, you can also use a simple mask as an efficient way to compute uniqueness. For COO, you can then just add the 1s in the mask for each row segment. For a sorted COO, the degree computation is actually trivial- you only need the row and columns arrays and do a segmented reduce.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we were using this function for rows,
coo_degree
was absolutely the right play. I was just trying to follow code reuse, but that ended up causing problems with larger datasets (in the form of illegal memory access errors). I have made it so this function is only used when we are trying to get a column-wise sum of the values (not just checking if there is a value like with rows). And we cant just use l1 normalization because I need the avg column size across all columns and the individual column avg. Thereduce by key
functions available in raft are for dense matrices only. This is why I have opted to use the thrustreduce_by_key
when we are doing the column based processing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This broadcasts out the values to the correct vector positions so that they match the the correct rows/column indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be better to create a single encode and pass the desired struct via a function parameter that can be relayed to the map call.