Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spectral embedding algorithm #2875

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

avolkov-intel
Copy link
Contributor

Description

Add implementation for spectral embedding algorithm in DAAL and oneDAL interfaces for this algorithm

Copy link
Contributor

@Vika-F Vika-F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not bad for the initial version.
Please decompose the compute function into smaller kernels. I'd move binary search part and Laplassian part into separate functions.
My other comments are below.

template <typename algorithmFPType, CpuType cpu>
services::Status computeEigenvectorsInplace(size_t nFeatures, algorithmFPType * eigenvectors, algorithmFPType * eigenvalues)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add arguments description.

Comment on lines 51 to 52
size_t numEmb = 1;
size_t numNeighbors = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DAAL we try to use full names where possible.

Suggested change
size_t numEmb = 1;
size_t numNeighbors = 1;
size_t numberOfEmbeddings = 1;
size_t numberOfNeighbors = 1;

Comment on lines 43 to 44
std::int64_t embedding_dim = 0;
std::int64_t num_neighbors = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's align with scikit and use components for the dimension of projected space. Also, in oneDAL we usually use count, not dim or num in naming.
In DAAL it would be numberOfComponents, etc.

Suggested change
std::int64_t embedding_dim = 0;
std::int64_t num_neighbors = -1;
std::int64_t component_count = 0;
std::int64_t neighbor_count = -1;

Comment on lines 74 to 72
void check_compute_result(const spectral_embedding::compute_result<>& result) {
array<Float> data_arr = row_accessor<const Float>(data_).pull({ 0, -1 });
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comparison with 'golden' data computed with sklearn.

auto desc =
get_descriptor(sp_emb::result_options::embedding);
//desc.set_embedding_dim(5);
//desc.set_num_neighbors(4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's compute the number of neighbors by the formula provided in the original request.

@avolkov-intel
Copy link
Contributor Author

/intelci: run

2 similar comments
@avolkov-intel
Copy link
Contributor Author

/intelci: run

@avolkov-intel
Copy link
Contributor Author

/intelci: run

@avolkov-intel
Copy link
Contributor Author

/intelci: run

@avolkov-intel
Copy link
Contributor Author

/intelci: run

@avolkov-intel
Copy link
Contributor Author

/intelci: run

1 similar comment
@avolkov-intel
Copy link
Contributor Author

/intelci: run

Copy link
Contributor

@icfaust icfaust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of the blue questions about the algo implementation. Nothing to be forced, I just want to start a conversation with you and the other reviewers.

// Use binary search to find such d that the number of verticies having distance <= d is filtNum
const size_t binarySearchIterNum = 20;
// TODO: add parallel_for
for (size_t i = 0; i < n; ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split off the BinarySearch into a separate function which could be inlined. (i.e. everything in this for loop). Then applying the daal::threader_for would be easier (something like this as an example https://github.com/oneapi-src/oneDAL/blob/main/cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i#L434)

x[i * n + i] = 0;
}

// Create Laplassian matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Create Laplassian matrix
// Create Laplacian matrix

// std::cout << std::endl;
// }

// Fill the output matrix with eigen vectors corresponding to the smallest eigen values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Fill the output matrix with eigen vectors corresponding to the smallest eigen values
// Fill the output matrix with eigenvectors corresponding to the smallest eigenvalues

DAAL_CHECK_BLOCK_STATUS(embedMatrix);
algorithmFPType * embed = embedMatrix.get();

for (int i = 0; i < k; ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this a dumb request, could you leave a comment in the code above this double for loop as to the matrix operation is doing? I know its related to the eigenvectors out of X, but why into the columns of embed? May save some time in the future for someone unfamiliar when they try to get up to speed. I can see its the transpose copy of part of a row of x into a column of embed, is that right?

{
M = (L + R) / 2;
cnt = 0;
// Calculate the number of elements in the row with value <= M
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just from a high level perspective, it looks like this code will run O(binarySearchIterNum*n) to try and find the proper M which yields the right number of neighbors. binarySearchIterNum will be log(1/eps) (from here https://github.com/oneapi-src/oneDAL/pull/2875/files#r1720389061) Would a binary search tree be faster? At some point also wouldn't just sorting the array may also be faster (when log(n) < log(1/eps))?

}

// Create Laplassian matrix
for (size_t i = 0; i < n; ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question, I sort assumed the nearest cosine neighbors from above created almost an adjacency matrix which could be used here to create the laplacian via L=D-A. I guess I am trying to understand the creation math here, could you send me a link what definition you used here?

DAAL_CHECK_BLOCK_STATUS(embedMatrix);
algorithmFPType * embed = embedMatrix.get();

for (int i = 0; i < k; ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also a todo for a parallel for here?


size_t lcnt, rcnt, cnt;
algorithmFPType L, R, M;
// Use binary search to find such d that the number of verticies having distance <= d is filtNum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Use binary search to find such d that the number of verticies having distance <= d is filtNum
// Use binary search to find such d that the number of vertices having distance <= d is filtNum

break;
}
}
// create edges for the closest neighbors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess now that I think about it this is getting in the direction of this sklearn function: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.kneighbors_graph.html Thoughts on using kneighbors from daal for this, or am I missing something big?

status |= computeEigenvectorsInplace<algorithmFPType, cpu>(n, x, eigenValuesPtr);
DAAL_CHECK_STATUS_VAR(status);

// std::cout << "Eigen vectors: " << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder to delete unused code

{
algorithmFPType val = (x[i * n + j] + x[j * n + i]) / 2;
x[i * n + j] = -val;
x[j * n + i] = -val;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bitconfused with inconsistency - in this case you are assigning values to the elements of matrix, but for the other part you are incrementing the symmetric ones. Does this matrix represent two triangular ones or should operations be actually symmetric?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am/was too. Hopefully a bit of clarity to help in a code comment.

DAAL_CHECK_BLOCK_STATUS(embedMatrix);
algorithmFPType * embed = embedMatrix.get();

for (int i = 0; i < k; ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be size_t instead?

constexpr std::int64_t neighbor_count = 5;
constexpr std::int64_t component_count = 4;

constexpr Float data[n * p] = { 0.49671415, -0.1382643, 0.64768854, 1.52302986,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this data is auto-generated. Please consider using rng to generate it.

{
if (x[i * n + j] <= R)
{
x[i * n + j] = 1.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x[i * n + j] = 1.0;
x[i * n + j] = algorithmFPType(1);

}
else
{
x[i * n + j] = 0.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x[i * n + j] = 0.0;
x[i * n + j] = algorithmFPType(0);

WriteRows<algorithmFPType, cpu> xMatrix(covOutput, 0, n);
DAAL_CHECK_BLOCK_STATUS(xMatrix);
algorithmFPType * x = xMatrix.get();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that according to SDL somewhere around here should be several DAAL_ASSERT_... statements to check if your access to the memory is contained.

Comment on lines +97 to +98
size_t lcnt, rcnt, cnt;
algorithmFPType L, R, M;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to initialize these variables. It's not a C code =D

Comment on lines +104 to +105
L = 0; // min possible cos distance
R = 2; // max possible cos distance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember at least some of these variables are actually floats. Should we initialize them accordingly?


struct KernelParameter : daal::algorithms::Parameter
{
size_t numberOfEmbeddings = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be DAAL_INT? Personally I have nothing against using size_t just caring for the consistency.

#include "oneapi/dal/algo/spectral_embedding.hpp"
#include "oneapi/dal/io/csv.hpp"
#include <algorithm>
#include <math.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <math.h>
#include <cmath>

Please use a regular C++ headers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants