`HcfMiddleware` consumes too many memory on large number of URLs #56

starrify · 2015-03-25T06:45:13Z

How to reproduce this issue

Write a simple spider to generate a large number of Requests and store them to HCF. For example, read some site's sitemap and generate several millions or more requests from it.
The HcfMiddleware will then comsume a really large amount of memory.

Cause of the issue

The current version of HcfMiddleware's source code could be accessed here. Related codes are at line 159 and line 167:

159: if not request.url in self.new_links[slot]:
         ...
167:     self.new_links[slot].add(request.url)

From which we could see the middleware maintains a duplicate filter here, which increases the memory overhead.

How to solve this issue

As the middleware is now not relying on new_links to upload URLs (it uses the batch uploader instead), and there are only two purposes of the new_links attribute:

To maintain a duplicate filter (here)
To report the total links stored (here)
It's suggested to add a hcf_dont_filter key into request.meta, and ingore the duplicate filter if the key is set to True

The text was updated successfully, but these errors were encountered:

starrify · 2015-03-25T06:45:47Z

I'll be working on a PR for solving this issue.

nramirezuy · 2015-03-25T13:01:21Z

I don't think your memory issues come from this middleware.

Are you using SitemapSpider along with this middleware?
Are you reading requests as you upload them with the same spider?

starrify · 2015-03-25T13:11:22Z

Sorry I mistakenly clicked on the "Close and comment" button just now.. :(

Thanks for the reply @nramirezuy :)

Yes. (But hacked into its _parse_sitemap and _parse_urlset methods)
No.
I'm working with a hacked SitemapSpider, a typical job sends ~500 requests and got ~10 million URLs to store to HCF. Wouldn't there be ~10 million URLs stored into HcfMiddleware.new_links? What I want is to eliminate this overhead. :)

nramirezuy · 2015-03-25T13:45:50Z

What do you mean by hacked, what are the changes made to the spider? SitemapSpider is extremely memory consuming because of big sitemap files, which implies massive responses.

I would suggest to not remove new_links just to change it to use the same technique Scheduler uses and make it optional by a setting.

starrify · 2015-03-25T16:42:53Z

What do you mean by hacked, what are the changes made to the spider? SitemapSpider is extremely memory consuming because of big sitemap files, which implies massive responses.

Setting request.meta["use_hcf"] = True for requests that come from the sitemap. Thus there won't be massive responses, only massive URLs to be stored to HCF.

I would suggest to not remove new_links just to change it to use the same technique Scheduler uses and make it optional by a setting.

Sure. This is exactly what I do in PR #57 :)

starrify added a commit to starrify/scrapylib that referenced this issue Mar 25, 2015

changed: Updated HcfMiddleware. Fixes scrapinghub#56

49c12af

starrify closed this as completed Mar 25, 2015

starrify reopened this Mar 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`HcfMiddleware` consumes too many memory on large number of URLs #56

`HcfMiddleware` consumes too many memory on large number of URLs #56

starrify commented Mar 25, 2015

starrify commented Mar 25, 2015

nramirezuy commented Mar 25, 2015

starrify commented Mar 25, 2015

nramirezuy commented Mar 25, 2015

starrify commented Mar 25, 2015

HcfMiddleware consumes too many memory on large number of URLs #56

HcfMiddleware consumes too many memory on large number of URLs #56

Comments

starrify commented Mar 25, 2015

How to reproduce this issue

Cause of the issue

How to solve this issue

starrify commented Mar 25, 2015

nramirezuy commented Mar 25, 2015

starrify commented Mar 25, 2015

nramirezuy commented Mar 25, 2015

starrify commented Mar 25, 2015

`HcfMiddleware` consumes too many memory on large number of URLs #56

`HcfMiddleware` consumes too many memory on large number of URLs #56