Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow exporting items data in CSV format #100

Open
stav opened this issue Jul 12, 2018 · 1 comment
Open

Allow exporting items data in CSV format #100

stav opened this issue Jul 12, 2018 · 1 comment

Comments

@stav
Copy link
Contributor

stav commented Jul 12, 2018

I am using the client to request items but I would like to limit them because some of the fields are way too big. So I only need a few fields but I want all the items. For example, this works fine for CSV if I declare the fields parameter:

$ curl -uAPIKEY: "https://storage.scrapinghub.com/items/244066/83/3?format=csv&fields=name,venue"

"Ole Miss","Vaught Hemingway Stadium"
"Kansas State","Bill Snyder Family Stadium"
"LSU","Tiger Stadium"

But when I try it with the client I get:

Python 3.6.3 (default, Oct  3 2017, 21:45:48)
>>> import scrapinghub
>>> scrapinghub.__version__
'2.0.3'
>>> client = scrapinghub.ScrapinghubClient(APIKEY)
>>> job = client.get_job('244066/83/3')
>>> items = job.items.list(format='csv', fields=['name,venue'])

requests.exceptions.HTTPError: 406 Client Error: Not Acceptable for url:
https://storage.scrapinghub.com/items/244066/83/3?format=csv&fields=name%2Cvenue

scrapinghub.client.exceptions.ScrapinghubAPIError: No acceptable
content types matching header 'application/x-msgpack' and format 'csv'
The following are supported: application/x-msgpack, application/xml,
text/csv, application/json, application/x-jsonlines

Ok, so let's try without msgpack:

>>> client = scrapinghub.ScrapinghubClient(APIKEY, use_msgpack=False)
>>> job = client.get_job('244066/83/3')
>>> items = job.items.list(format='csv', fields=['name,venue'])

File "scrapinghub/hubstorage/serialization.py", line 25, in jldecode
    yield loads(line)
json.decoder.JSONDecodeError: Extra data: line 1 column 11 (char 10)

So the problem is that the response is assumed by the client to be JSON and it tries to decode the string:

'"Ole Miss","Vaught Hemingway Stadium"'

Ok, let's try it with json now:

>>> items = job.items.list(format='json', fields=['name,venue'])
>>> items

[[{'name': 'Ole Miss', 'venue': 'Vaught Hemingway Stadium',
   'venue_address': 'All-American Dr, University, MS 38677, EUA',
   'date': 1542857400000.0,...

Well, there's no error, but we get all the fields instead of just the two we request, effectively the fields parameter is ignored.

So maybe we could patch scrapinghub/hubstorage/resourcetype.py:apirequest() to check for the csv format to bypass the json decoding but it would actually be better if the backend api supported this field subset declaration for other formats, namely json.

I see that the api supports max_fields and we know that csv supports field limiting so maybe it's not a big deal to get the api to support field limiting for json as well.

@vshlapakov
Copy link
Contributor

Thanks for the report, Steven! So there're 2 issues in fact:

  1. JSON exporter ignores fields parameter - as it should be fixed server-side, I created an inner ticket for fix this when we have resources
  2. job.items.list doesn't handle exporting data in CSV and yeah, it's something that we could improve in the library itself. However I think a better way would be to avoid overloading of existing list method and to add a separate method for this, like job.items.export handling same set of filters as list method and returning a string, or maybe dumping it to a file.

@vshlapakov vshlapakov changed the title Cannot get a limited fields subset Allow to export items data in CSV format Sep 17, 2018
@stav stav changed the title Allow to export items data in CSV format Allow exporting items data in CSV format Sep 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants