-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consuming items data to df creates inconsistencies with jsonschema #83
Comments
I don't see any other choice as to store
|
Actually, there are two ways:
|
this can help to save memory if df only approach is to stay scrapinghub/python-scrapinghub#121 |
I believe the first thing to do is create a test case to expose the problem. Then, my first suggestion is to try to cast the Pandas' column right after its ingestion. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html |
@victor-torres The test case was indeed the first thing.
Even having float\integers right doesn't solve |
@manycoding I'm not familiar on how jsonschema validates fields with Suppose we want to check the However, in some weird cases it would return the incorrect boolean value even when there are indeed I found it best to use Perhaps if we could swap out jsonschema's validation right from Arche, this might solve the inconsistencies. EDIT: The same problem happens to the snippet below, even if I've converted the
|
Interesting article about pandas and handling missing data: https://www.oreilly.com/learning/handling-missing-data what if instead pandas data is loaded into numpy arrays. Something like:
in this way we will have:
And on the other hand we have from one side:
|
Not right now, since it's a useful tool which cannot be easily replaced. |
This one is good. Although, storing One note about json null None - from jsonschema validation point of view, we don't care about json file. It just works. |
|
Caused by #75
Pandas makes it's own casts which is incompatible with jsonschema dict validation by default.
For example if items data:
[{"availability": 1, "_key": "0"}, {"_key": "1"}}]
or
[{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]
DataFrame in both cases would be
Yet JSON Schema
null
type meansNone
, and missing field is validated with not putting it inrequired
. So we have:Missing field (on purpose)
None
field (on purpose)Last but not least, the inconsistencies between JSON schema and data persist when we feed a dataframe directly (unless a user manages it himself).
The text was updated successfully, but these errors were encountered: