Unconditional summary stats results on numerical data: margin case #156

alextanski · 2019-04-30T14:35:11Z

Is there any way I can get the margin result for a 2D cube (first dimensions being a numerical variable, crossed by a categorical), i.e. the mean of all cases with non-empty data for the two dimensions? I am unable to find something that works like cube.measures.scale_means.ScaleMeans.margin() method for numerical data.

Example setup:

Setting up the measure for the mean:

numeric = 'open_realrange'
mean = {
        "function": 'cube_mean',
        "args": [
            {
                "function": "cast",
                "args": [
                    {
                        "variable": "datasets/{}/variables/{}".format(
                            ds.id,
                            ds[numeric].id
                        )
                    },
                    {"class": "numeric"}
                ]
            }
        ]
    }

Then using pycrunch.cubes.fetch_cube and the CrunchCube api to query the results from Crunch:

from pycrunch.cubes import fetch_cube, count
from cr.cube.crunch_cube import CrunchCube

crossed_by = 'dropdown'
cube = fetch_cube(ds.resource, [crossed_by], mean=mean)
cube = CrunchCube(cube)

Which gives me:

CrunchCube(name='dropdown', dim_types='CAT')
slices[0]: CubeSlice(name='dropdown', dim_types='CAT', dims='dropdown')
               N
-------  -------
vl:2013  44
vl:2014  33.4167
vl:2015  33

I guess the 1D structure of that cube would cause a margin() result to fail anyway? Which leads to the related question of how I would get unconditional / 1D statistics on numerical data in general: #157

The text was updated successfully, but these errors were encountered:

slobodan-ilic · 2019-04-30T16:17:17Z

👀

slobodan-ilic · 2019-05-01T11:09:13Z

After some inspection, please try margin(weighted=False). There are no weighted counts, hence the fail of the margin(), which assumes weighted=True. This is buggy, but you might be able to solve this by this hack. Please let me know if this works (it should) while I address the other issue (properly).

slobodan-ilic · 2019-05-01T11:23:37Z

Additionally, you can try the following:

>>> cube.slices[0]
CubeSlice(name='Shapes of pasta', dim_types='CAT', dims='Shapes of pasta')
                   N
-----------  -------
Bucatini     39.4727
Chitarra     47.873
Boccoli      46.7192
Orecchiette  49.671
Quadrefiore  50.7232
Fileja       38.5867

>>> cube.margin(weighted=False)
array([1658])
>>> cube.margin(weighted=False, include_missing=True)
array([1662])
>>>

because that should include the "missings" of the opposite dimension, and thus be "unconditional".

@malecki can you comment (if I'm right or wrong)?

P.S. - results are from a different dataset than the one you used, lest there be confusion about the numbers...

alextanski · 2019-05-01T11:31:59Z

Thanks for this @slobodan-ilic. I am not at work today, but from what I see above I guess there is a misunderstanding (probably due to my initial phrasing of the issue) about the "margin result": What we are after in this case is the "margin mean", i.e. the mean across all cases that contain valid data for Shapes of pasta. I am pretty sure that the ScaleMeans implementation of margin() shows exactly that.

alextanski · 2019-05-01T11:39:01Z

To add to my comment from above, here is the docstring from the ScaleMeans version:

    def margin(self, axis):
        """Return marginal value of the current slice scaled means.
        This value is the the same what you would get from a single variable
        (constituting a 2D cube/slice), when the "non-missing" filter of the
        opposite variable would be applied. This behavior is consistent with
        what is visible in the front-end client.
        """

We would need a numeric data equivalent here.

slobodan-ilic · 2019-05-01T13:26:13Z

I don't think the scale_means is the thing that we should be looking at here, whether it's the values or the margin thereof. It's the property of the single variable alone (numeric values of the categories, combined with counts). The mean is calculated differently on server side, actually calculating the mean of a different variable (the one you select in the mean) and then returning those values for each category of the original categorical variable (and presenting the resulting cube as "just that" categorical variable, even though the results are actually means of a different variable, that you don't explicitly see in the cube result).

So the solution here (and I've confirmed this with @malecki ), is to just make an additional request of what you want from the server. The same way you'd have to do it in our web client. So for the case that I've used, it would be something like (just use it without the crossed_by part):

>>> cube = fetch_cube(ds, [], mean=mean)
>>> cube
CrunchCube(name='None', dim_types='')
>>> cube.as_array()
array(47.35198556)

slobodan-ilic · 2019-05-01T13:36:28Z

Hm... I've just figured out that the weight argument in fetch_cube works a little bit weird. If you apply it, it kinda sets the weight for the entire DS, and everything from there on is weighted. It even sets the weight for me in the web app, which I wouldn't expect.

alextanski · 2019-05-01T13:41:39Z

Ha! I did not even know that this is possible (I tried to simply fetch an "empty" cube but I think I passed None or something). This is a perfect solution for #157. It does not solve the particular issue outlined in here though as the empty cube mean is not the same as the crossed_by "marginal" one. The former is truly unconditional, the latter should be restricted by the valid data for that dimension.

If this result is simply not obtainable server-sided this is perfectly fine and we should not dig any deeper. This issue came up looking at the interface from a consistency perspective. #157 (which should be solved by the code above) constituted a real blocker. Thanks @slobodan-ilic and @malecki for looking at that so quickly!

@jamesrkg: Agree to simply close and let both issues rest for now once I cheked against Rogo's deck / dataset?

malecki · 2019-05-03T20:42:10Z

We’re planning some work to improve how numeric variables are dealt with, in particular in multitables where the approach already is the unconditional row variable as the first subcube, followed by whatever other conditioning column variables. It is not currently possible to request the measure cube_mean of numerics via the multitable export endpoint at all, and the first task will be to remedy that.

alextanski mentioned this issue Apr 30, 2019

Unconditional summary stats results on numerical data: explicit 1D case #157

Open

slobodan-ilic closed this as completed May 1, 2019

slobodan-ilic reopened this May 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unconditional summary stats results on numerical data: margin case #156

Unconditional summary stats results on numerical data: margin case #156

alextanski commented Apr 30, 2019 •

edited

Loading

slobodan-ilic commented Apr 30, 2019

slobodan-ilic commented May 1, 2019

slobodan-ilic commented May 1, 2019 •

edited

Loading

alextanski commented May 1, 2019

alextanski commented May 1, 2019 •

edited

Loading

slobodan-ilic commented May 1, 2019

slobodan-ilic commented May 1, 2019

alextanski commented May 1, 2019 •

edited

Loading

malecki commented May 3, 2019

Unconditional summary stats results on numerical data: margin case #156

Unconditional summary stats results on numerical data: margin case #156

Comments

alextanski commented Apr 30, 2019 • edited Loading

slobodan-ilic commented Apr 30, 2019

slobodan-ilic commented May 1, 2019

slobodan-ilic commented May 1, 2019 • edited Loading

alextanski commented May 1, 2019

alextanski commented May 1, 2019 • edited Loading

slobodan-ilic commented May 1, 2019

slobodan-ilic commented May 1, 2019

alextanski commented May 1, 2019 • edited Loading

malecki commented May 3, 2019

alextanski commented Apr 30, 2019 •

edited

Loading

slobodan-ilic commented May 1, 2019 •

edited

Loading

alextanski commented May 1, 2019 •

edited

Loading

alextanski commented May 1, 2019 •

edited

Loading