mutualinfo with KSG1/KSG2's results different from scikit-learn and about 2x slower for large arrays #367

liuyxpp · 2024-05-23T02:33:14Z

In Python + scikit-learn

from sklearn import feature_selection
import numpy as np

X = np.arange(10)
X = X / np.pi
X = X.reshape((10,1))
y =  np.array([0.1, 0.3, 0.2, 0.4, 0.5, 0.7, 0.6, 0.8, 1.0, 0.9])
# By defaut: n_neighbors = 3
feature_selection.mutual_info_regression(X, y)   # = 0.69563492

In Julia + CausalityTools

import CausalityTools: mutualinfo, KSG1, KSG2

X = (0:9) ./ π
y = [0.1, 0.3, 0.2, 0.4, 0.5, 0.7, 0.6, 0.8, 1.0, 0.9]
mutualinfo(KSG1(k=3), X, y)  # = -0.43223601423458935
mutualinfo(KSG2(k=3), X, y)  # = 0.4746008686099013

The results are clearly different. In addition, in scikit-learn, if mi<0, 0 will be reported.

For running time, I have

In Python

X = np.random.rand(171522,1)
y = np.random.rand(171522,)
tic=time.perf_counter(); feature_selection.mutual_info_regression(X,y); toc=time.perf_counter()
array([0.00072401])
toc - tic  # = 0.96151989325881 seconds

In Julia

mutualinfo(KSG1(k=3), rand(171522), rand(171522))
@time mutualinfo(KSG1(k=3), rand(171522), rand(171522))  # 2.245426 seconds (857.67 k allocations: 83.230 MiB)

mutualinfo(KSG2(k=3), rand(171522), rand(171522))
@time mutualinfo(KSG2(k=3), rand(171522), rand(171522))  # 2.113910 seconds (857.68 k allocations: 91.082 MiB)

So, Julia implementation is much slower than Python implementation.

kahaaga · 2024-05-23T14:34:58Z

Hey, @liuyxpp! Thanks for the report.

Can you report back the output of Pkg.status() in the same session as you're testing, so I know which versions of packages you're using?

Using the @time macro in Julia isn't the optimal way of testing real-world performance for snippets of code. To ensure that the code is actually slower than the Python implementation, can you please use the @btime macro from BenchmarkTools.jl to check again? Make sure to run your code once before testing, so that pre-compilation kicks in. Like so:

using Random; rng = MersenneTwister(1234)
using CausalityTools, BenchmarkTools
x = rand(rng, 10000)
y = rand(rng, 10000)

mutualinfo(KSG1(), x, y) # run once for precompilation
@btime mutualinfo(KSG1(), $x, $y) # now time it

If the performance gain is also evident after testing properly with @btime, then the difference might be due to implementation differences.

The KSG1/KSG2 implementations here are implemented strictly as stated in the original paper linked inner documentation. From what I can tell, the Python implementation is based on both the original paper and Ross's paper, which I am not familiar with. Perhaps they implement some performance enhancements? Have you had a look at the paper?
We use NearestNeighbors.jl's methods for nearest neighbor searches in our implementation. Are you familiar with which routine for nearest neighbor searches the Python method uses?

kahaaga · 2024-05-23T14:40:03Z

@liuyxpp Also: are you running also sort of multithreading for the Python code?

kahaaga · 2024-05-24T08:09:00Z

It would also be nice to see whether this apparent performance difference is also true for higher dimensional data. In the Ross paper which is linked in the python docs, they use a specialized fast implementation for 1D data. Could you try it out with two large 3D datasets, for example?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mutualinfo with KSG1/KSG2's results different from scikit-learn and about 2x slower for large arrays #367

mutualinfo with KSG1/KSG2's results different from scikit-learn and about 2x slower for large arrays #367

liuyxpp commented May 23, 2024 •

edited

Loading

kahaaga commented May 23, 2024 •

edited

Loading

kahaaga commented May 23, 2024

kahaaga commented May 24, 2024

mutualinfo with KSG1/KSG2's results different from scikit-learn and about 2x slower for large arrays #367

mutualinfo with KSG1/KSG2's results different from scikit-learn and about 2x slower for large arrays #367

Comments

liuyxpp commented May 23, 2024 • edited Loading

kahaaga commented May 23, 2024 • edited Loading

kahaaga commented May 23, 2024

kahaaga commented May 24, 2024

liuyxpp commented May 23, 2024 •

edited

Loading

kahaaga commented May 23, 2024 •

edited

Loading