Consuming items data to df creates inconsistencies with jsonschema #83

manycoding · 2019-05-08T19:37:18Z

Caused by #75

Pandas makes it's own casts which is incompatible with jsonschema dict validation by default.
For example if items data:
[{"availability": 1, "_key": "0"}, {"_key": "1"}}]
or
[{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]
DataFrame in both cases would be

  _key  availability
0    0           1.0
1    1           NaN

Yet JSON Schema null type means None, and missing field is validated with not putting it in required. So we have:
Missing field (on purpose)

{
    "required": ["_key"],
    "properties": {
        "_key": {"type": "string"},
        "availability": {"type": "integer"},
    },
    "additionalProperties": False,
}

None field (on purpose)

{
    "required": ["_key", availability],
    "properties": {
        "_key": {"type": "string"},
        "availability": {"type": ["integer", null]},
    },
    "additionalProperties": False,
}

Last but not least, the inconsistencies between JSON schema and data persist when we feed a dataframe directly (unless a user manages it himself).

The text was updated successfully, but these errors were encountered:

manycoding · 2019-05-09T16:58:07Z

I don't see any other choice as to store dict too.
To make df compatible with items data we need:

Make sure integers are not casted to floats in columns with NAN
Remove NAN for values which are not presented in items data (so we need to check with the dict anyway)

manycoding · 2019-05-09T18:26:10Z

Actually, there are two ways:

Imagine NAN are missing values in most cases - so we can read a first item as a dict, save info about integers\floats and cast the types appropriately. All the data goes to a df.
Keep raw data. Meaning more memory
I'd like to compare speed\memory

manycoding · 2019-05-09T19:04:49Z

this can help to save memory if df only approach is to stay scrapinghub/python-scrapinghub#121

victor-torres · 2019-05-09T20:30:21Z

I believe the first thing to do is create a test case to expose the problem.

Then, my first suggestion is to try to cast the Pandas' column right after its ingestion.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

manycoding · 2019-05-09T20:58:56Z

@victor-torres The test case was indeed the first thing.
The type can probably be casted but:

there's no guarantee we find an item with all the fields, so we can easily miss one
casting takes some insignificant time

Even having float\integers right doesn't solve NAN issue.
P.S. I update the description so it's more obvious.

BurnzZ · 2019-05-10T00:21:23Z

@manycoding I'm not familiar on how jsonschema validates fields with null values, but I found some problems that inconsistently validates NaN values when reading items using https://github.com/scrapinghub/python-scrapinghub which are then placed into a pd.DataFrame.

Suppose we want to check the null-ness of a given column/field, as you have correctly mentioned, pandas casts it into a NaN value. So instinctively, we could do np.isnan(df.some_field) to check, since pandas uses np.nan underneath.

However, in some weird cases it would return the incorrect boolean value even when there are indeed NaN values in it.

I found it best to use pd.isnull(df.some_field) instead to consistently identify them correctly.

Perhaps if we could swap out jsonschema's validation right from Arche, this might solve the inconsistencies.

EDIT: The same problem happens to the snippet below, even if I've converted the DataFrame into a dict, so using pd.isnull() is still preferred compared to pd.isnan():

for item in df.to_dict('record'):
    {k: v for k, v in item.items() if not pd.isnull(v)}

ivankivanov · 2019-05-10T07:07:59Z

Interesting article about pandas and handling missing data:

https://www.oreilly.com/learning/handling-missing-data

what if instead pandas data is loaded into numpy arrays. Something like:

def load_data(path):
    with open(path) as f:
        data = json.load(f)
    print(data)
    return np.asarray(data)

in this way we will have:

1️⃣ empty string - {"city": "Prague", "name": ""} - 'name': ''
2️⃣ missing field - {"city": "Berlin"} - nothing
3️⃣ None - {"city": "Prague", "name": None} - error because None is not valid json
4️⃣ null - {"city": "Prague", "name": null} - 'name': None

And on the other hand we have from one side:

the pythonic way for storing missing values - None
json representation for missing values - null

manycoding · 2019-05-10T15:18:03Z

Perhaps if we could swap out jsonschema's validation right from Arche, this might solve the inconsistencies.

Not right now, since it's a useful tool which cannot be easily replaced.
Removing jsonschema will mean everybody will have to use pandas to check fields types and/or regex (and other stuff done with schemas) manually or such validation (meaning reading schema to validate df) will have to be created for pandas to replace jsonschema library.

manycoding · 2019-05-10T15:41:24Z

what if instead pandas data is loaded into numpy arrays. Something like:

This one is good. Although, storing np.array requires memory. I need to check if it's faster than dictionary anyway, suprisingly, ~~I found that storing items data in dict doesn't up a lot to memory.~~ it was a garbage collector

One note about json null None - from jsonschema validation point of view, we don't care about json file. It just works.
What we care about is that if we have a missing field in items data - it will be missing in the data jsonschema verifies. And if the field is None, we care that it is None in the data jsonschema verifies. We care until we use jsonschema.

manycoding · 2019-05-10T17:42:45Z

%memit for Arche.report_all() with schema, 10_000 items:

np, list, df (where I am moving to)
peak memory: 223.23 MiB, increment: 29.59 MiB
df (current broken way)
peak memory: 197.89 MiB, increment: 69.20 MiB
df, dict (the old way)
peak memory: 383.12 MiB, increment: 26.45 MiB

manycoding added Type: Bug Something isn't working Priority: High labels May 8, 2019

manycoding added this to the 0.4.0 milestone May 8, 2019

manycoding mentioned this issue May 10, 2019

Numpy #85

Merged

manycoding modified the milestones: 0.4.0, 0.3.5 May 10, 2019

manycoding added a commit that referenced this issue May 13, 2019

Accept any iterables, fixes #83

32e7406

manycoding mentioned this issue May 13, 2019

List as input #87

Merged

manycoding closed this as completed in #85 May 13, 2019

manycoding reopened this May 14, 2019

manycoding mentioned this issue May 14, 2019

NAN consistency in JSON schema, pandas and numpy scrapinghub/spidermon#161

Closed

manycoding closed this as completed in #87 May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consuming items data to df creates inconsistencies with jsonschema #83

Consuming items data to df creates inconsistencies with jsonschema #83

manycoding commented May 8, 2019 •

edited

Loading

manycoding commented May 9, 2019

manycoding commented May 9, 2019 •

edited

Loading

manycoding commented May 9, 2019

victor-torres commented May 9, 2019

manycoding commented May 9, 2019 •

edited

Loading

BurnzZ commented May 10, 2019 •

edited

Loading

ivankivanov commented May 10, 2019

manycoding commented May 10, 2019 •

edited

Loading

manycoding commented May 10, 2019 •

edited

Loading

manycoding commented May 10, 2019

Consuming items data to df creates inconsistencies with jsonschema #83

Consuming items data to df creates inconsistencies with jsonschema #83

Comments

manycoding commented May 8, 2019 • edited Loading

manycoding commented May 9, 2019

manycoding commented May 9, 2019 • edited Loading

manycoding commented May 9, 2019

victor-torres commented May 9, 2019

manycoding commented May 9, 2019 • edited Loading

BurnzZ commented May 10, 2019 • edited Loading

ivankivanov commented May 10, 2019

manycoding commented May 10, 2019 • edited Loading

manycoding commented May 10, 2019 • edited Loading

manycoding commented May 10, 2019

manycoding commented May 8, 2019 •

edited

Loading

manycoding commented May 9, 2019 •

edited

Loading

manycoding commented May 9, 2019 •

edited

Loading

BurnzZ commented May 10, 2019 •

edited

Loading

manycoding commented May 10, 2019 •

edited

Loading

manycoding commented May 10, 2019 •

edited

Loading