-
Notifications
You must be signed in to change notification settings - Fork 78
HcfMiddleware
consumes too many memory on large number of URLs
#56
Comments
I'll be working on a PR for solving this issue. |
I don't think your memory issues come from this middleware.
|
Sorry I mistakenly clicked on the "Close and comment" button just now.. :( Thanks for the reply @nramirezuy :)
|
What do you mean by hacked, what are the changes made to the spider? I would suggest to not remove |
Setting
Sure. This is exactly what I do in PR #57 :) |
How to reproduce this issue
Write a simple spider to generate a large number of
Request
s and store them to HCF. For example, read some site's sitemap and generate several millions or more requests from it.The
HcfMiddleware
will then comsume a really large amount of memory.Cause of the issue
The current version of
HcfMiddleware
's source code could be accessed here. Related codes are at line 159 and line 167:From which we could see the middleware maintains a duplicate filter here, which increases the memory overhead.
How to solve this issue
As the middleware is now not relying on
new_links
to upload URLs (it uses the batch uploader instead), and there are only two purposes of thenew_links
attribute:It's suggested to add a
hcf_dont_filter
key intorequest.meta
, and ingore the duplicate filter if the key is set toTrue
The text was updated successfully, but these errors were encountered: