Add spectral embedding algorithm #2875

avolkov-intel · 2024-08-16T09:24:51Z

Description

Add implementation for spectral embedding algorithm in DAAL and oneDAL interfaces for this algorithm

Vika-F

Not bad for the initial version.
Please decompose the compute function into smaller kernels. I'd move binary search part and Laplassian part into separate functions.
My other comments are below.

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_fpt_cpu.cpp

Vika-F · 2024-08-16T09:32:17Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+template <typename algorithmFPType, CpuType cpu>
+services::Status computeEigenvectorsInplace(size_t nFeatures, algorithmFPType * eigenvectors, algorithmFPType * eigenvalues)


Please add arguments description.

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

Vika-F · 2024-08-16T10:13:15Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_kernel.h

+    size_t numEmb = 1;
+    size_t numNeighbors = 1;


In DAAL we try to use full names where possible.

Suggested change

size_t numEmb = 1;

size_t numNeighbors = 1;

size_t numberOfEmbeddings = 1;

size_t numberOfNeighbors = 1;

cpp/oneapi/dal/algo/spectral_embedding/backend/cpu/compute_kernel.cpp

Vika-F · 2024-08-16T10:21:57Z

cpp/oneapi/dal/algo/spectral_embedding/common.cpp

+    std::int64_t embedding_dim = 0;
+    std::int64_t num_neighbors = -1;


Let's align with scikit and use components for the dimension of projected space. Also, in oneDAL we usually use count, not dim or num in naming.
In DAAL it would be numberOfComponents, etc.

Suggested change

std::int64_t embedding_dim = 0;

std::int64_t num_neighbors = -1;

std::int64_t component_count = 0;

std::int64_t neighbor_count = -1;

Vika-F · 2024-08-16T10:28:05Z

cpp/oneapi/dal/algo/spectral_embedding/test/fixture.hpp

+    void check_compute_result(const spectral_embedding::compute_result<>& result) {
+        array<Float> data_arr = row_accessor<const Float>(data_).pull({ 0, -1 });
+    }


Please add comparison with 'golden' data computed with sklearn.

Vika-F · 2024-08-16T10:30:09Z

cpp/oneapi/dal/algo/spectral_embedding/test/fixture.hpp

+        auto desc =
+            get_descriptor(sp_emb::result_options::embedding);
+        //desc.set_embedding_dim(5);
+        //desc.set_num_neighbors(4);


Let's compute the number of neighbors by the formula provided in the original request.

avolkov-intel · 2024-08-21T16:16:23Z

/intelci: run

avolkov-intel · 2024-08-21T21:59:10Z

/intelci: run

avolkov-intel · 2024-08-22T10:35:57Z

/intelci: run

avolkov-intel · 2024-08-22T17:58:04Z

/intelci: run

avolkov-intel · 2024-08-26T13:00:26Z

/intelci: run

avolkov-intel · 2024-08-29T15:55:02Z

/intelci: run

avolkov-intel · 2024-08-29T19:05:53Z

/intelci: run

icfaust

Out of the blue questions about the algo implementation. Nothing to be forced, I just want to start a conversation with you and the other reviewers.

icfaust · 2024-09-02T07:38:17Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+    // Use binary search to find such d that the number of verticies having distance <= d is filtNum
+    const size_t binarySearchIterNum = 20;
+    // TODO: add parallel_for
+    for (size_t i = 0; i < n; ++i)


I would split off the BinarySearch into a separate function which could be inlined. (i.e. everything in this for loop). Then applying the daal::threader_for would be easier (something like this as an example https://github.com/oneapi-src/oneDAL/blob/main/cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i#L434)

icfaust · 2024-09-02T07:40:24Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+        x[i * n + i] = 0;
+    }
+
+    // Create Laplassian matrix


Suggested change

// Create Laplassian matrix

// Create Laplacian matrix

icfaust · 2024-09-02T07:43:49Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+    //     std::cout << std::endl;
+    // }
+
+    // Fill the output matrix with eigen vectors corresponding to the smallest eigen values


Suggested change

// Fill the output matrix with eigen vectors corresponding to the smallest eigen values

// Fill the output matrix with eigenvectors corresponding to the smallest eigenvalues

icfaust · 2024-09-02T07:48:58Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+    DAAL_CHECK_BLOCK_STATUS(embedMatrix);
+    algorithmFPType * embed = embedMatrix.get();
+
+    for (int i = 0; i < k; ++i)


Sorry if this a dumb request, could you leave a comment in the code above this double for loop as to the matrix operation is doing? I know its related to the eigenvectors out of X, but why into the columns of embed? May save some time in the future for someone unfamiliar when they try to get up to speed. I can see its the transpose copy of part of a row of x into a column of embed, is that right?

icfaust · 2024-09-02T08:00:04Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+        {
+            M   = (L + R) / 2;
+            cnt = 0;
+            // Calculate the number of elements in the row with value <= M


Just from a high level perspective, it looks like this code will run O(binarySearchIterNum*n) to try and find the proper M which yields the right number of neighbors. binarySearchIterNum will be log(1/eps) (from here https://github.com/oneapi-src/oneDAL/pull/2875/files#r1720389061) Would a binary search tree be faster? At some point also wouldn't just sorting the array may also be faster (when log(n) < log(1/eps))?

icfaust · 2024-09-02T08:13:17Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+    }
+
+    // Create Laplassian matrix
+    for (size_t i = 0; i < n; ++i)


Dumb question, I sort assumed the nearest cosine neighbors from above created almost an adjacency matrix which could be used here to create the laplacian via L=D-A. I guess I am trying to understand the creation math here, could you send me a link what definition you used here?

icfaust · 2024-09-02T08:14:15Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+    DAAL_CHECK_BLOCK_STATUS(embedMatrix);
+    algorithmFPType * embed = embedMatrix.get();
+
+    for (int i = 0; i < k; ++i)


Maybe also a todo for a parallel for here?

icfaust · 2024-09-02T08:15:00Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+
+    size_t lcnt, rcnt, cnt;
+    algorithmFPType L, R, M;
+    // Use binary search to find such d that the number of verticies having distance <= d is filtNum


Suggested change

// Use binary search to find such d that the number of verticies having distance <= d is filtNum

// Use binary search to find such d that the number of vertices having distance <= d is filtNum

icfaust · 2024-09-02T08:17:15Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+                break;
+            }
+        }
+        // create edges for the closest neighbors


So I guess now that I think about it this is getting in the direction of this sklearn function: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.kneighbors_graph.html Thoughts on using kneighbors from daal for this, or am I missing something big?

KulikovNikita · 2024-09-11T17:16:14Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+    status |= computeEigenvectorsInplace<algorithmFPType, cpu>(n, x, eigenValuesPtr);
+    DAAL_CHECK_STATUS_VAR(status);
+
+    // std::cout << "Eigen vectors: " << std::endl;


Just a reminder to delete unused code

KulikovNikita · 2024-09-11T17:20:13Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+        {
+            algorithmFPType val = (x[i * n + j] + x[j * n + i]) / 2;
+            x[i * n + j]        = -val;
+            x[j * n + i]        = -val;


I'm a little bitconfused with inconsistency - in this case you are assigning values to the elements of matrix, but for the other part you are incrementing the symmetric ones. Does this matrix represent two triangular ones or should operations be actually symmetric?

I am/was too. Hopefully a bit of clarity to help in a code comment.

KulikovNikita · 2024-09-11T17:22:16Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+    DAAL_CHECK_BLOCK_STATUS(embedMatrix);
+    algorithmFPType * embed = embedMatrix.get();
+
+    for (int i = 0; i < k; ++i)


Should it be size_t instead?

KulikovNikita · 2024-09-11T17:43:58Z

cpp/oneapi/dal/algo/spectral_embedding/test/fixture.hpp

+        constexpr std::int64_t neighbor_count = 5;
+        constexpr std::int64_t component_count = 4;
+
+        constexpr Float data[n * p] = { 0.49671415,  -0.1382643,  0.64768854,  1.52302986,


It seems this data is auto-generated. Please consider using rng to generate it.

KulikovNikita · 2024-09-15T11:04:56Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+        {
+            if (x[i * n + j] <= R)
+            {
+                x[i * n + j] = 1.0;


Suggested change

x[i * n + j] = 1.0;

x[i * n + j] = algorithmFPType(1);

KulikovNikita · 2024-09-15T11:05:19Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+            }
+            else
+            {
+                x[i * n + j] = 0.0;


Suggested change

x[i * n + j] = 0.0;

x[i * n + j] = algorithmFPType(0);

KulikovNikita · 2024-09-15T11:06:58Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+    WriteRows<algorithmFPType, cpu> xMatrix(covOutput, 0, n);
+    DAAL_CHECK_BLOCK_STATUS(xMatrix);
+    algorithmFPType * x = xMatrix.get();
+


I believe that according to SDL somewhere around here should be several DAAL_ASSERT_... statements to check if your access to the memory is contained.

KulikovNikita · 2024-09-15T11:07:54Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+    size_t lcnt, rcnt, cnt;
+    algorithmFPType L, R, M;


I would recommend to initialize these variables. It's not a C code =D

KulikovNikita · 2024-09-15T11:08:57Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i

+        L    = 0; // min possible cos distance
+        R    = 2; // max possible cos distance


As far as I remember at least some of these variables are actually floats. Should we initialize them accordingly?

KulikovNikita · 2024-09-15T11:11:03Z

cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_kernel.h

+
+struct KernelParameter : daal::algorithms::Parameter
+{
+    size_t numberOfEmbeddings = 1;


Should it be DAAL_INT? Personally I have nothing against using size_t just caring for the consistency.

KulikovNikita · 2024-09-15T11:13:02Z

examples/oneapi/cpp/source/spectral_clustering/spectral_clustering_pipeline.cpp

+#include "oneapi/dal/algo/spectral_embedding.hpp"
+#include "oneapi/dal/io/csv.hpp"
+#include <algorithm>
+#include <math.h>


Suggested change

#include <math.h>

#include <cmath>

Please use a regular C++ headers

Vika-F reviewed Aug 16, 2024

View reviewed changes

avolkov-intel added 9 commits August 22, 2024 10:56

Initial commit

7afd556

Update ouput format

dfe990c

Add comments

34d930f

Add dummy dpc backend and run clang-format

e67cb3e

Add spectral clustering example

b93530e

Minor

a208c80

Update makefile.lst

32b9425

Add DAAL_EXPORT, update copyrights, address comments

8f0e130

Update build file

c3318e9

avolkov-intel force-pushed the dev/spectral-embeddings branch from b0c3428 to c3318e9 Compare August 22, 2024 17:56

Update BUILD files

4eb85bc

avolkov-intel added 2 commits August 28, 2024 06:54

Add eigen_values result option, update naming

aaeada8

Add test

a48b92e

icfaust reviewed Sep 2, 2024

View reviewed changes

KulikovNikita reviewed Sep 11, 2024

View reviewed changes

KulikovNikita reviewed Sep 15, 2024

View reviewed changes

		template <typename algorithmFPType, CpuType cpu>
		services::Status computeEigenvectorsInplace(size_t nFeatures, algorithmFPType * eigenvectors, algorithmFPType * eigenvalues)

		std::int64_t embedding_dim = 0;
		std::int64_t num_neighbors = -1;

	// Fill the output matrix with eigen vectors corresponding to the smallest eigen values
	// Fill the output matrix with eigenvectors corresponding to the smallest eigenvalues

	// Use binary search to find such d that the number of verticies having distance <= d is filtNum
	// Use binary search to find such d that the number of vertices having distance <= d is filtNum

		L = 0; // min possible cos distance
		R = 2; // max possible cos distance

Add spectral embedding algorithm #2875

Are you sure you want to change the base?

Add spectral embedding algorithm #2875

Conversation

avolkov-intel commented Aug 16, 2024

Description

Vika-F left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avolkov-intel commented Aug 21, 2024

avolkov-intel commented Aug 21, 2024

avolkov-intel commented Aug 22, 2024

avolkov-intel commented Aug 22, 2024

avolkov-intel commented Aug 26, 2024

avolkov-intel commented Aug 29, 2024

avolkov-intel commented Aug 29, 2024

icfaust left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment