Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutualinfo with KSG1/KSG2's results different from scikit-learn and about 2x slower for large arrays #367

Open
liuyxpp opened this issue May 23, 2024 · 3 comments

Comments

@liuyxpp
Copy link

liuyxpp commented May 23, 2024

In Python + scikit-learn

from sklearn import feature_selection
import numpy as np

X = np.arange(10)
X = X / np.pi
X = X.reshape((10,1))
y =  np.array([0.1, 0.3, 0.2, 0.4, 0.5, 0.7, 0.6, 0.8, 1.0, 0.9])
# By defaut: n_neighbors = 3
feature_selection.mutual_info_regression(X, y)   # = 0.69563492

In Julia + CausalityTools

import CausalityTools: mutualinfo, KSG1, KSG2

X = (0:9) ./ π
y = [0.1, 0.3, 0.2, 0.4, 0.5, 0.7, 0.6, 0.8, 1.0, 0.9]
mutualinfo(KSG1(k=3), X, y)  # = -0.43223601423458935
mutualinfo(KSG2(k=3), X, y)  # = 0.4746008686099013

The results are clearly different. In addition, in scikit-learn, if mi<0, 0 will be reported.

For running time, I have

In Python

X = np.random.rand(171522,1)
y = np.random.rand(171522,)
tic=time.perf_counter(); feature_selection.mutual_info_regression(X,y); toc=time.perf_counter()
array([0.00072401])
toc - tic  # = 0.96151989325881 seconds

In Julia

mutualinfo(KSG1(k=3), rand(171522), rand(171522))
@time mutualinfo(KSG1(k=3), rand(171522), rand(171522))  # 2.245426 seconds (857.67 k allocations: 83.230 MiB)

mutualinfo(KSG2(k=3), rand(171522), rand(171522))
@time mutualinfo(KSG2(k=3), rand(171522), rand(171522))  # 2.113910 seconds (857.68 k allocations: 91.082 MiB)

So, Julia implementation is much slower than Python implementation.

@kahaaga
Copy link
Member

kahaaga commented May 23, 2024

Hey, @liuyxpp! Thanks for the report.

Can you report back the output of Pkg.status() in the same session as you're testing, so I know which versions of packages you're using?

  • Using the @time macro in Julia isn't the optimal way of testing real-world performance for snippets of code. To ensure that the code is actually slower than the Python implementation, can you please use the @btime macro from BenchmarkTools.jl to check again? Make sure to run your code once before testing, so that pre-compilation kicks in. Like so:
using Random; rng = MersenneTwister(1234)
using CausalityTools, BenchmarkTools
x = rand(rng, 10000)
y = rand(rng, 10000)

mutualinfo(KSG1(), x, y) # run once for precompilation
@btime mutualinfo(KSG1(), $x, $y) # now time it

If the performance gain is also evident after testing properly with @btime, then the difference might be due to implementation differences.

  • The KSG1/KSG2 implementations here are implemented strictly as stated in the original paper linked inner documentation. From what I can tell, the Python implementation is based on both the original paper and Ross's paper, which I am not familiar with. Perhaps they implement some performance enhancements? Have you had a look at the paper?
  • We use NearestNeighbors.jl's methods for nearest neighbor searches in our implementation. Are you familiar with which routine for nearest neighbor searches the Python method uses?

@kahaaga
Copy link
Member

kahaaga commented May 23, 2024

@liuyxpp Also: are you running also sort of multithreading for the Python code?

@kahaaga
Copy link
Member

kahaaga commented May 24, 2024

It would also be nice to see whether this apparent performance difference is also true for higher dimensional data. In the Ross paper which is linked in the python docs, they use a specialized fast implementation for 1D data. Could you try it out with two large 3D datasets, for example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants