Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consuming items data to df creates inconsistencies with jsonschema #83

Closed
manycoding opened this issue May 8, 2019 · 10 comments · Fixed by #85 or #87
Closed

Consuming items data to df creates inconsistencies with jsonschema #83

manycoding opened this issue May 8, 2019 · 10 comments · Fixed by #85 or #87
Labels
Priority: High Type: Bug Something isn't working
Milestone

Comments

@manycoding
Copy link
Contributor

manycoding commented May 8, 2019

Caused by #75

Pandas makes it's own casts which is incompatible with jsonschema dict validation by default.
For example if items data:
[{"availability": 1, "_key": "0"}, {"_key": "1"}}]
or
[{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]
DataFrame in both cases would be

  _key  availability
0    0           1.0
1    1           NaN

Yet JSON Schema null type means None, and missing field is validated with not putting it in required. So we have:
Missing field (on purpose)

{
    "required": ["_key"],
    "properties": {
        "_key": {"type": "string"},
        "availability": {"type": "integer"},
    },
    "additionalProperties": False,
}

None field (on purpose)

{
    "required": ["_key", availability],
    "properties": {
        "_key": {"type": "string"},
        "availability": {"type": ["integer", null]},
    },
    "additionalProperties": False,
}

Last but not least, the inconsistencies between JSON schema and data persist when we feed a dataframe directly (unless a user manages it himself).

@manycoding manycoding added Type: Bug Something isn't working Priority: High labels May 8, 2019
@manycoding manycoding added this to the 0.4.0 milestone May 8, 2019
@manycoding
Copy link
Contributor Author

I don't see any other choice as to store dict too.
To make df compatible with items data we need:

  1. Make sure integers are not casted to floats in columns with NAN
  2. Remove NAN for values which are not presented in items data (so we need to check with the dict anyway)

@manycoding
Copy link
Contributor Author

manycoding commented May 9, 2019

Actually, there are two ways:

  1. Imagine NAN are missing values in most cases - so we can read a first item as a dict, save info about integers\floats and cast the types appropriately. All the data goes to a df.
  2. Keep raw data. Meaning more memory
    I'd like to compare speed\memory

@manycoding
Copy link
Contributor Author

this can help to save memory if df only approach is to stay scrapinghub/python-scrapinghub#121

@victor-torres
Copy link
Contributor

I believe the first thing to do is create a test case to expose the problem.

Then, my first suggestion is to try to cast the Pandas' column right after its ingestion.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

@manycoding
Copy link
Contributor Author

manycoding commented May 9, 2019

@victor-torres The test case was indeed the first thing.
The type can probably be casted but:

  1. there's no guarantee we find an item with all the fields, so we can easily miss one
  2. casting takes some insignificant time

Even having float\integers right doesn't solve NAN issue.
P.S. I update the description so it's more obvious.

@BurnzZ
Copy link

BurnzZ commented May 10, 2019

@manycoding I'm not familiar on how jsonschema validates fields with null values, but I found some problems that inconsistently validates NaN values when reading items using https://github.com/scrapinghub/python-scrapinghub which are then placed into a pd.DataFrame.

Suppose we want to check the null-ness of a given column/field, as you have correctly mentioned, pandas casts it into a NaN value. So instinctively, we could do np.isnan(df.some_field) to check, since pandas uses np.nan underneath.

However, in some weird cases it would return the incorrect boolean value even when there are indeed NaN values in it.

I found it best to use pd.isnull(df.some_field) instead to consistently identify them correctly.

Perhaps if we could swap out jsonschema's validation right from Arche, this might solve the inconsistencies.


EDIT: The same problem happens to the snippet below, even if I've converted the DataFrame into a dict, so using pd.isnull() is still preferred compared to pd.isnan():

for item in df.to_dict('record'):
    {k: v for k, v in item.items() if not pd.isnull(v)}

@ivankivanov
Copy link
Member

Interesting article about pandas and handling missing data:

https://www.oreilly.com/learning/handling-missing-data

what if instead pandas data is loaded into numpy arrays. Something like:

def load_data(path):
    with open(path) as f:
        data = json.load(f)
    print(data)
    return np.asarray(data) 

in this way we will have:

  • 1️⃣ empty string - {"city": "Prague", "name": ""} - 'name': ''
  • 2️⃣ missing field - {"city": "Berlin"} - nothing
  • 3️⃣ None - {"city": "Prague", "name": None} - error because None is not valid json
  • 4️⃣ null - {"city": "Prague", "name": null} - 'name': None

And on the other hand we have from one side:

  • the pythonic way for storing missing values - None
  • json representation for missing values - null

@manycoding
Copy link
Contributor Author

manycoding commented May 10, 2019

Perhaps if we could swap out jsonschema's validation right from Arche, this might solve the inconsistencies.

Not right now, since it's a useful tool which cannot be easily replaced.
Removing jsonschema will mean everybody will have to use pandas to check fields types and/or regex (and other stuff done with schemas) manually or such validation (meaning reading schema to validate df) will have to be created for pandas to replace jsonschema library.

@manycoding
Copy link
Contributor Author

manycoding commented May 10, 2019

what if instead pandas data is loaded into numpy arrays. Something like:

This one is good. Although, storing np.array requires memory. I need to check if it's faster than dictionary anyway, suprisingly, I found that storing items data in dict doesn't up a lot to memory. it was a garbage collector

One note about json null None - from jsonschema validation point of view, we don't care about json file. It just works.
What we care about is that if we have a missing field in items data - it will be missing in the data jsonschema verifies. And if the field is None, we care that it is None in the data jsonschema verifies. We care until we use jsonschema.

@manycoding
Copy link
Contributor Author

%memit for Arche.report_all() with schema, 10_000 items:

  • np, list, df (where I am moving to)
    peak memory: 223.23 MiB, increment: 29.59 MiB

  • df (current broken way)
    peak memory: 197.89 MiB, increment: 69.20 MiB

  • df, dict (the old way)
    peak memory: 383.12 MiB, increment: 26.45 MiB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High Type: Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants