-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
msgpack errors when using iter() with intervals between each batch call #121
Comments
Hey Kevin, thanks for the report! Short question, have you used latest 2.1.1 version? There was a hidden bug up to the version which could lead to the wrong iteration behavior. |
If yes, there's something else you could try. In the current implementation you try to iterate through all items at once, and assuming the amount of items is huge and processing data takes time - it's possible to reach some timeout. But as you know the desired chunk size beforehand and it's large enough, you could send a request per chunk using pagination parameters and handle the data with pandas as long as you need, something like this:
|
Thanks @vshlapakov! The project was using Otherwise, I'll try to use the the pagination suggestion you've introduced. Cheers! |
Hi @vshlapakov, reporting in that version |
Gotcha, thanks for the update! Let me know when you test the approach 👍 |
Hi @vshlapakov, I've made a PR in #133 from your suggestion. I think having this convenient method would be really helpful in cases where we're processing a large number of items. @manycoding, I see that this might also be of use to arche from your issue in scrapinghub/arche#140. Thanks! |
Would it not be nicer to have it as a default (at some point) behavior in normal def iter(..., buffer: Optional[int] = None, in_chunks: bool = False):
if not buffer:
... # proceed as usual
return
for chunk in self._list_iter(...):
if in_chunks:
# for those actually needing chunks
yield chunk
else:
yield from chunk
|
I do get where you're coming from @hermit-crab, though I think |
Thank you for the reply @BurnzZ. Yes, I understand they will generate different structures but to elaborate more on what I mean is I believe this solution creates a situation where you have 2 distinct methods which do roughly the same thing (retrieving a resource of a job) with one of them being clearly preferable over the other despite slightly different and less commonly needed output format of the same data. At that point why would you ever use For instance on the issue mentioned above scrapinghub/shub-workflow#5, the solution will end up being something like in here: def _process_job_items(self, scrapername, spider_job):
first_keyprefix = None
items_gen = (item for chunk in spider_job.items.iter_in_chunks() for item in chunk)
# or any other variation of flattening a list
for item in items_gen: While I think it would be nicer to just provide a flag to That would be similar style to how pandas do it to allow for io/memory efficient reads with chunksize parameter on their |
@hermit-crab That makes sense to me, the methods are very close to each other, the only major difference is the output format, while the I'm going to close this issue as the original problem is solved, but I'm looking forward to improve it if possible, when we agree on the implementation. |
Hi @BurnzZ, @hermit-crab, @vshlapakov, @Gallaecio, I have observed a situation (rare and random) where using You can see in the urllib3 debug log that it's making the following API call in the backend: Worth mentioning here that even though count=1000 but the start value (435191/897/44/35697000) is huge as the job is processing around 43M items in chunks of 1000. Would converting the iterator to a list help solve this issue? Let me know if I should open up an issue for this or if you need more input from me. Thanks. |
Good Day!
I've encountered this peculiar issue when trying to save up memory by processing the items in chunks. Here's a strip down version of the code for reproduction of the issue:
Here's the common error it throws:
Moreover, it throws out a different error when using a much bigger chunk-size, like 50000:
I find that the workaround/solution for this would be to have a lower value for
chunk
. So far, 1000 works great.This uses
scrapy:1.5
stack in Scrapy Cloud.I'm guessing this might have something to do with the long waiting time that happens when processing the pandas DataFrame chunk, and when the next batch of items are being iterated, the server might have deallocated the pointer to it or something.
May I ask if there might be a solution for this? since a much bigger
chunk
size will help with the speed of our jobs.I've marked it as bug for now as this is quite an unexpected/undocumented behavior.
Cheers!
The text was updated successfully, but these errors were encountered: