Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zipf: let n have type F #1518

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Zipf: let n have type F #1518

wants to merge 5 commits into from

Conversation

dhardy
Copy link
Member

@dhardy dhardy commented Oct 24, 2024

  • Added a CHANGELOG.md entry

Summary

Change the parameter type of Zipf's n to F

Motivation

#1323 (comment)

Details

The CDF test fails:

---- zipf stdout ----
KS statistic: 0.13359213049244018
Critical value: 0.00195
thread 'zipf' panicked at rand_distr/tests/cdf.rs:119:5:
assertion failed: ks_statistic < critical_value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The value stability tests (only 4 samples; bottom of zipf.rs) did not fail.

@benjamin-lieser
Copy link
Collaborator

You cannot use the continuous test, because the cdf is not continuous. It does work, if you replace
test_continuous(seed as u64, dist, |k| cdf(k, n, x));
with
test_discrete(seed as u64, dist, |k| cdf(k as f64, n, x));

@dhardy dhardy marked this pull request as ready for review October 25, 2024 07:19
@dhardy
Copy link
Member Author

dhardy commented Oct 25, 2024

This should be mostly ready, but do we want to make this change?

Also note: #1517 added a note about casting results to ints being safe as a result of the input bounding the output; the input can now be larger too.

@benjamin-lieser
Copy link
Collaborator

benjamin-lieser commented Oct 25, 2024

I am not sure if we should do it. I do not have some quantitative proof for it, but I doubt that for values bigger u64::MAX there is a measurable difference to the Zeta distribution. So if such big values are desired there is already a solution, albeit less elegant because the user has to do the case distinction.
I guess most users will use Zipf with smaller integers, but to be honest I never really had a need myself, so this is a very vague guess.

Another point might be the name. If you search for Zipf, you fill find mostly the Zeta distribution, scipy has our Zipf as Zipfian. Maybe this would be a better name.

Edit: I guess there is still a difference to Zeta, because even for 2**64 there is significant mass in the tail of the harmonic series. And also Zeta does not support s=1. So there is definitely a potential usecase for this.

@dhardy
Copy link
Member Author

dhardy commented Oct 25, 2024

I'll wait for @vks to comment.

Copy link
Collaborator

@vks vks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I'm a bit confused why there is a test failure.

@@ -385,7 +385,7 @@ fn zipf() {
let parameters = [(1000, 1.0), (500, 2.0), (1000, 0.5)];

for (seed, (n, x)) in parameters.into_iter().enumerate() {
let dist = rand_distr::Zipf::new(n, x).unwrap();
let dist = rand_distr::Zipf::new(n as f64, x).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might as well change the values in parameters to floats.

@benjamin-lieser
Copy link
Collaborator

Looks good, but I'm a bit confused why there is a test failure.

Which test failure?

Do you have an opinion on the name? Zipf vs Zipfian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants