Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add tfidf bm25 #2353

Open
wants to merge 97 commits into
base: branch-24.12
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
a6677ca
update master references
ajschmidt8 Jul 14, 2020
ad2d7d7
REL DOC Updates for main branch switch
mike-wendt Jul 16, 2020
e3c9344
Merge pull request #272 from rapidsai/branch-21.06
ajschmidt8 Jun 10, 2021
3b0a6d2
Merge pull request #321 from rapidsai/branch-21.08
ajschmidt8 Sep 16, 2021
309ea1a
REL v21.08.00 release
GPUtester Apr 6, 2022
3740998
Merge pull request #612 from rapidsai/branch-22.04
raydouglass Apr 6, 2022
e987ec8
REL v22.04.00 release
GPUtester Apr 6, 2022
0b55c32
Add `conda` compilers (#702)
ajschmidt8 Jun 7, 2022
229b9f8
update changelog
raydouglass Jun 7, 2022
0eded98
Merge pull request #708 from rapidsai/branch-22.06
raydouglass Jun 7, 2022
3e5a625
FIX update-version.sh
raydouglass Jun 7, 2022
ad50a7f
Merge pull request #709 from rapidsai/branch-22.06
raydouglass Jun 7, 2022
ed2c529
REL v22.06.00 release
GPUtester Jun 7, 2022
aae5e34
Merge pull request #782 from rapidsai/branch-22.08
raydouglass Aug 17, 2022
87a7d16
REL v22.08.00 release
GPUtester Aug 17, 2022
1de93ba
Merge pull request #908 from rapidsai/branch-22.10
raydouglass Oct 12, 2022
31ae597
REL v22.10.00 release
GPUtester Oct 12, 2022
08abc72
[HOTFIX] Update cuda-python dependency to 11.7.1 (#963)
cjnolet Nov 4, 2022
c6e6ce8
Merge pull request #988 from rapidsai/branch-22.10
raydouglass Nov 4, 2022
f7d2335
REL v22.10.01 release
GPUtester Nov 4, 2022
c16fa56
Merge pull request #1063 from rapidsai/branch-22.12
raydouglass Dec 8, 2022
9a716b7
REL v22.12.00 release
GPUtester Dec 8, 2022
60936ba
Merge pull request #1101 from rapidsai/branch-22.12
raydouglass Dec 14, 2022
a655c9a
REL v22.12.01 release
GPUtester Dec 14, 2022
9a66f42
Merge pull request #1250 from rapidsai/branch-23.02
raydouglass Feb 9, 2023
69dce2d
REL v23.02.00 release
raydouglass Feb 9, 2023
1467154
Merge pull request #1405 from rapidsai/branch-23.04
raydouglass Apr 12, 2023
7d1057e
REL v23.04.00 release
raydouglass Apr 12, 2023
dc800d6
REL v23.04.01 release
raydouglass Apr 21, 2023
520e12c
REL Merge pull request #1486 from rapidsai/branch-23.04
raydouglass May 3, 2023
f626bf1
Merge pull request #1549 from rapidsai/branch-23.06
raydouglass Jun 7, 2023
c931b61
REL v23.06.00 release
raydouglass Jun 7, 2023
af1515d
Merge pull request #1589 from rapidsai/branch-23.06
raydouglass Jun 12, 2023
9147c90
REL v23.06.01 release
raydouglass Jun 12, 2023
59ae9d6
Merge pull request #1636 from rapidsai/branch-23.06
raydouglass Jul 5, 2023
7dd2f6d
REL v23.06.02 release
raydouglass Jul 5, 2023
5797ef5
Merge pull request #1692 from rapidsai/branch-23.08
raydouglass Aug 9, 2023
e588d7b
REL v23.08.00 release
raydouglass Aug 9, 2023
51f52c1
Merge pull request #1863 from rapidsai/branch-23.10
raydouglass Oct 11, 2023
afdddfb
REL v23.10.00 release
raydouglass Oct 11, 2023
e9f9aa8
Merge pull request #2020 from rapidsai/branch-23.12
raydouglass Dec 6, 2023
599651e
REL v23.12.00 release
raydouglass Dec 6, 2023
9e2d627
REL Revert update-version.sh changes for release
raydouglass Dec 6, 2023
1143113
Merge pull request #2134 from rapidsai/branch-24.02
raydouglass Feb 12, 2024
698d6c7
REL v24.02.00 release
raydouglass Feb 12, 2024
e0d40e5
Merge pull request #2240 from rapidsai/branch-24.04
raydouglass Apr 10, 2024
fa44bcc
REL v24.04.00 release
raydouglass Apr 10, 2024
41938c4
Merge pull request #2341 from rapidsai/branch-24.06
raydouglass Jun 5, 2024
63a506d
REL v24.06.00 release
raydouglass Jun 5, 2024
427ea26
add in support for preprocessing with bm25 and tfidf
jperez999 Jun 5, 2024
ffbfbc7
add in test cases and header file
jperez999 Jun 6, 2024
2d82aca
add tfidf coo support
jperez999 Jun 25, 2024
dc01bc1
add in header for coo tfidf
jperez999 Jun 25, 2024
6f4745d
add bm25 test support coo in and refactor tfidf support
jperez999 Jun 26, 2024
987ff5e
add in long test for coo to csr convert test
jperez999 Jun 28, 2024
c46008c
remove unneeded print statement
jperez999 Jun 28, 2024
81bb89d
remove unneeded test
jperez999 Jun 28, 2024
ff1991f
add csr and coo matrix bfknn apis
jperez999 Jul 3, 2024
c593f4e
add knn to preprocess tests
jperez999 Jul 3, 2024
0febb55
all tests in place and refactor code
jperez999 Jul 4, 2024
6477cd4
add in cmake for test files
jperez999 Jul 4, 2024
c836ba8
adjust tests, coo now passes all checks
jperez999 Jul 4, 2024
ce8253e
csr and coo tests passing, refactor feature preprocessing
jperez999 Jul 6, 2024
442cd7a
refactor names to make more generic
jperez999 Jul 7, 2024
b1720c7
further refactor to feature and id variable names
jperez999 Jul 7, 2024
3365ec3
add documentation and refactor to use num rows and num cols from matrix
jperez999 Jul 8, 2024
06b6df2
update tests to reflect values given refactor
jperez999 Jul 8, 2024
034d2c5
add documentation
jperez999 Jul 8, 2024
04bb007
removed unnecessary imports and variables
jperez999 Jul 8, 2024
3747291
fix function docs to reflect behavior more correctly
jperez999 Jul 9, 2024
281a029
Merge branch 'branch-24.08' into add-tfidf-bm25
jperez999 Jul 10, 2024
3d66d4b
Update docs/source/contributing.md
jperez999 Jul 10, 2024
2b70436
Update .github/PULL_REQUEST_TEMPLATE.md
jperez999 Jul 10, 2024
84ffc8b
Update .github/PULL_REQUEST_TEMPLATE.md
jperez999 Jul 10, 2024
63607bd
Merge branch 'branch-24.08' into add-tfidf-bm25
jperez999 Jul 12, 2024
0f462a9
Merge branch 'branch-24.08' into add-tfidf-bm25
jperez999 Jul 31, 2024
1fc27f3
Merge branch 'branch-24.10' into add-tfidf-bm25
jperez999 Jul 31, 2024
82cfb1f
Merge branch 'branch-24.10' into add-tfidf-bm25
jperez999 Aug 7, 2024
1155609
Merge branch 'branch-24.10' into add-tfidf-bm25
jperez999 Aug 14, 2024
6302957
Merge branch 'branch-24.10' into add-tfidf-bm25
cjnolet Aug 29, 2024
05f4af2
fix preprocessing and make tests run on r random at generation
jperez999 Sep 11, 2024
a1e3a48
remove unnecessary imports
jperez999 Sep 11, 2024
44f3e1c
remove log for tf
jperez999 Sep 11, 2024
e25e2de
added more template changes
jperez999 Sep 11, 2024
187e148
Merge branch 'branch-24.10' into add-tfidf-bm25
jperez999 Sep 11, 2024
ec4e4a2
Merge branch 'branch-24.10' into add-tfidf-bm25
jperez999 Sep 18, 2024
e6d2c1c
remove excess thrust calls
jperez999 Sep 24, 2024
5120c97
add better comment on inputs for tests
jperez999 Sep 24, 2024
81e2a41
Merge branch 'add-tfidf-bm25' of https://github.com/jperez999/raft in…
jperez999 Sep 24, 2024
90373ab
Merge branch 'branch-24.10' into add-tfidf-bm25
jperez999 Sep 24, 2024
87a729c
fixed scale errors
jperez999 Sep 26, 2024
63576b0
remove vector based public apis
jperez999 Sep 26, 2024
c123acb
add in bfknn tests for csr and coo sparse matrices
jperez999 Sep 27, 2024
29f14d9
Merge branch 'branch-24.12' into add-tfidf-bm25
rhdong Sep 27, 2024
0ca6e10
Merge branch 'branch-24.12' into add-tfidf-bm25
jperez999 Oct 16, 2024
b000065
remove unused functions
jperez999 Oct 17, 2024
3507771
Merge branch 'add-tfidf-bm25' of https://github.com/jperez999/raft in…
jperez999 Oct 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@ Here are some guidelines to help the review process go smoothly.
features or make changes out of the scope of those requested by the reviewer
(doing this just add delays as already reviewed code ends up having to be
re-reviewed/it is hard to tell what is new etc!). Further, please do not
rebase your branch on master/force push/rewrite history, doing any of these
rebase your branch on main/force push/rewrite history, doing any of these
jperez999 marked this conversation as resolved.
Show resolved Hide resolved
causes the context of any comments made by reviewers to be lost. If
conflicts occur against master they should be resolved by merging master
conflicts occur against main they should be resolved by merging main
jperez999 marked this conversation as resolved.
Show resolved Hide resolved
into the branch used for making the pull request.

Many thanks in advance for your cooperation!
Expand Down
215 changes: 215 additions & 0 deletions cpp/include/raft/sparse/matrix/detail/preprocessing.cuh
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <raft/core/device_mdarray.hpp>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to remove all these imports?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally you just import what you need, so if you need all of these then go ahead and import them. Otherwise, try to remove things that are unneeded.

#include <raft/core/device_span.hpp>
#include <raft/core/host_mdarray.hpp>
#include <raft/core/math.hpp>
#include <raft/core/resource/cuda_stream.hpp>
#include <raft/core/resource/thrust_policy.hpp>
#include <raft/core/resources.hpp>
#include <raft/sparse/neighbors/cross_component_nn.cuh>
#include <raft/sparse/op/sort.cuh>
#include <raft/sparse/selection/knn.cuh>

#include <thrust/fill.h>
#include <thrust/functional.h>
#include <thrust/reduce.h>

struct bm25 {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created these structs to condense the logic into a single map call for preprocessing.

bm25(int num_docs, float avg_doc_len, float k_param, float b_param)
{
total_docs = num_docs;
avg_doc_length = avg_doc_len;
k = k_param;
b = b_param;
}

template <typename T1>
float __device__ operator()(const T1& values, const T1& doc_length, const T1& num_docs_term_occ)
{
return raft::log<float>(total_docs / (1 + num_docs_term_occ)) *
((values * (k + 1)) / (values + k * (1 - b + b * (doc_length / avg_doc_length))));
}
float avg_doc_length;
int total_docs;
float k;
float b;
};

struct tfidf {
tfidf(int total_docs_param) { total_docs = total_docs_param; }

template <typename T1, typename T2>
float __device__ operator()(const T1& values, const T2& num_docs_term_occ)
{
return raft::log<float>(1 + values) * raft::log<float>(total_docs / (1 + num_docs_term_occ));
}
int total_docs;
};

template <typename T>
struct mapper {
mapper(raft::device_vector_view<const T> map) : map(map) {}

__host__ __device__ void operator()(T& value) const
{
const T& new_value = map[value];
if (new_value) {
value = new_value;
} else {
value = 0;
}
}

raft::device_vector_view<const T> map;
};

template <typename T1, typename T2>
void get_uniques_counts(raft::resources& handle,
raft::device_vector_view<T1, int64_t> sort_vector,
raft::device_vector_view<T1, int64_t> secondary_vector,
raft::device_vector_view<T2, int64_t> data,
raft::device_vector_view<T2, int64_t> itr_vals,
raft::device_vector_view<T1, int64_t> keys_out,
raft::device_vector_view<T2, int64_t> counts_out)
{
cudaStream_t stream = raft::resource::get_cuda_stream(handle);
raft::sparse::op::coo_sort(sort_vector.size(),
secondary_vector.size(),
data.size(),
sort_vector.data_handle(),
secondary_vector.data_handle(),
data.data_handle(),
stream);

thrust::reduce_by_key(raft::resource::get_thrust_policy(handle),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up using the thrust version because it could handle vectors, which allows me to use the same code for both the csr and coo matrix versions of the encoding logic. Also the raft version does not support sparse matrix versions.

Copy link
Member

@cjnolet cjnolet Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to compute the degree of each row in the sparse format? We have routines for this already. We have a coo_degree function here. Degree computation for CSR is actually really trivial- since you already have an array of offsets, you don't even need to count the columns because you can literally just diff the array (e.g. compute the difference between each value in the indptr array and the value that occurred before it). If you can't guarantee uniqueness, you can also use a simple mask as an efficient way to compute uniqueness. For COO, you can then just add the 1s in the mask for each row segment. For a sorted COO, the degree computation is actually trivial- you only need the row and columns arrays and do a segmented reduce.

Copy link
Author

@jperez999 jperez999 Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we were using this function for rows, coo_degree was absolutely the right play. I was just trying to follow code reuse, but that ended up causing problems with larger datasets (in the form of illegal memory access errors). I have made it so this function is only used when we are trying to get a column-wise sum of the values (not just checking if there is a value like with rows). And we cant just use l1 normalization because I need the avg column size across all columns and the individual column avg. The reduce by key functions available in raft are for dense matrices only. This is why I have opted to use the thrust reduce_by_key when we are doing the column based processing.

sort_vector.data_handle(),
sort_vector.data_handle() + sort_vector.size(),
itr_vals.data_handle(),
keys_out.data_handle(),
counts_out.data_handle());
}

template <typename T1, typename T2>
void create_mapped_vector(raft::resources& handle,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broadcasts out the values to the correct vector positions so that they match the the correct rows/column indexes.

raft::device_vector_view<T1, int64_t> origin,
raft::device_vector_view<T1, int64_t> keys,
raft::device_vector_view<T2, int64_t> counts,
raft::device_vector_view<T2, int64_t> result)
{
cudaStream_t stream = raft::resource::get_cuda_stream(handle);
auto host_keys = raft::make_host_vector<T1, int64_t>(handle, keys.size());

raft::copy(host_keys.data_handle(), keys.data_handle(), keys.size(), stream);
raft::linalg::map(handle, result, raft::cast_op<T2>{}, raft::make_const_mdspan(origin));
auto origin_map = raft::make_device_vector<T2, int64_t>(handle, host_keys(host_keys.size()) + 1);

thrust::scatter(raft::resource::get_thrust_policy(handle),
jperez999 marked this conversation as resolved.
Show resolved Hide resolved
counts.data_handle(),
counts.data_handle() + counts.size(),
keys.data_handle(),
origin_map.data_handle());

thrust::for_each(raft::resource::get_thrust_policy(handle),
jperez999 marked this conversation as resolved.
Show resolved Hide resolved
result.data_handle(),
result.data_handle() + result.size(),
mapper<T2>(raft::make_const_mdspan(origin_map.view())));
}

template <typename T1, typename T2>
std::tuple<int, int> sparse_search_preprocess(raft::resources& handle,
raft::device_vector_view<T1, int64_t> rows,
raft::device_vector_view<T1, int64_t> columns,
raft::device_vector_view<T2, int64_t> values,
raft::device_vector_view<T2, int64_t> doc_lengths,
raft::device_vector_view<T2, int64_t> term_counts)
{
cudaStream_t stream = raft::resource::get_cuda_stream(handle);

auto num_rows =
raft::sparse::neighbors::get_n_components(rows.data_handle(), rows.size(), stream);

auto row_keys = raft::make_device_vector<int, int64_t>(handle, num_rows);
auto row_counts = raft::make_device_vector<float, int64_t>(handle, num_rows);
auto row_fill = raft::make_device_vector<float, int64_t>(handle, rows.size());

// the amount of columns(documents) that each row(term) is found in
thrust::fill(raft::resource::get_thrust_policy(handle),
jperez999 marked this conversation as resolved.
Show resolved Hide resolved
row_fill.data_handle(),
row_fill.data_handle() + row_fill.size(),
1.0f);
get_uniques_counts(
handle, rows, columns, values, row_fill.view(), row_keys.view(), row_counts.view());

create_mapped_vector<int, float>(handle, rows, row_keys.view(), row_counts.view(), term_counts);
auto num_cols =
raft::sparse::neighbors::get_n_components(columns.data_handle(), columns.size(), stream);
auto col_keys = raft::make_device_vector<int, int64_t>(handle, num_cols);
auto col_counts = raft::make_device_vector<float, int64_t>(handle, num_cols);

get_uniques_counts(handle, columns, rows, values, values, col_keys.view(), col_counts.view());

int total_document_lengths = thrust::reduce(raft::resource::get_thrust_policy(handle),
col_counts.data_handle(),
col_counts.data_handle() + col_counts.size());
float avg_doc_length = float(total_document_lengths) / col_keys.size();

create_mapped_vector<int, float>(
handle, columns, col_keys.view(), col_counts.view(), doc_lengths);
return {col_keys.size(), avg_doc_length};
}

template <typename T1, typename T2>
void encode_tfidf(raft::resources& handle,
raft::device_vector_view<T1, int64_t> rows,
raft::device_vector_view<T1, int64_t> columns,
raft::device_vector_view<T2, int64_t> values,
raft::device_vector_view<T2, int64_t> values_out)
{
auto doc_lengths = raft::make_device_vector<float, int64_t>(handle, columns.size());
auto term_counts = raft::make_device_vector<float, int64_t>(handle, rows.size());
auto [doc_count, avg_doc_length] = sparse_search_preprocess<int, float>(
handle, rows, columns, values, doc_lengths.view(), term_counts.view());

raft::linalg::map(handle,
values_out,
tfidf(doc_count),
raft::make_const_mdspan(values),
raft::make_const_mdspan(term_counts.view()));
}

template <typename T1, typename T2>
void encode_bm25(raft::resources& handle,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better to create a single encode and pass the desired struct via a function parameter that can be relayed to the map call.

raft::device_vector_view<T1, int64_t> rows,
raft::device_vector_view<T1, int64_t> columns,
raft::device_vector_view<T2, int64_t> values,
raft::device_vector_view<T2, int64_t> values_out,
float k_param = 1.6f,
float b_param = 0.75)
{
auto doc_lengths = raft::make_device_vector<float, int64_t>(handle, columns.size());
auto term_counts = raft::make_device_vector<float, int64_t>(handle, rows.size());
auto [doc_count, avg_doc_length] = sparse_search_preprocess<int, float>(
handle, rows, columns, values, doc_lengths.view(), term_counts.view());

raft::linalg::map(handle,
values_out,
bm25(doc_count, avg_doc_length, k_param, b_param),
raft::make_const_mdspan(values),
raft::make_const_mdspan(doc_lengths.view()),
raft::make_const_mdspan(term_counts.view()));
}
2 changes: 0 additions & 2 deletions docs/source/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,5 +89,3 @@ implementation of the issue, ask them in the issue instead of the PR.

## Attribution
Portions adopted from https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md
jperez999 marked this conversation as resolved.
Show resolved Hide resolved


Loading