Skip to content
This repository has been archived by the owner on Jul 19, 2018. It is now read-only.

HcfMiddleware consumes too many memory on large number of URLs #56

Open
starrify opened this issue Mar 25, 2015 · 5 comments
Open

HcfMiddleware consumes too many memory on large number of URLs #56

starrify opened this issue Mar 25, 2015 · 5 comments

Comments

@starrify
Copy link
Member

How to reproduce this issue

Write a simple spider to generate a large number of Requests and store them to HCF. For example, read some site's sitemap and generate several millions or more requests from it.
The HcfMiddleware will then comsume a really large amount of memory.

Cause of the issue

The current version of HcfMiddleware's source code could be accessed here. Related codes are at line 159 and line 167:

159: if not request.url in self.new_links[slot]:
         ...
167:     self.new_links[slot].add(request.url)

From which we could see the middleware maintains a duplicate filter here, which increases the memory overhead.

How to solve this issue

As the middleware is now not relying on new_links to upload URLs (it uses the batch uploader instead), and there are only two purposes of the new_links attribute:

  1. To maintain a duplicate filter (here)
  2. To report the total links stored (here)
    It's suggested to add a hcf_dont_filter key into request.meta, and ingore the duplicate filter if the key is set to True
@starrify
Copy link
Member Author

I'll be working on a PR for solving this issue.

starrify added a commit to starrify/scrapylib that referenced this issue Mar 25, 2015
@nramirezuy
Copy link
Contributor

I don't think your memory issues come from this middleware.

  1. Are you using SitemapSpider along with this middleware?
  2. Are you reading requests as you upload them with the same spider?

@starrify starrify reopened this Mar 25, 2015
@starrify
Copy link
Member Author

Sorry I mistakenly clicked on the "Close and comment" button just now.. :(

Thanks for the reply @nramirezuy :)

  1. Yes. (But hacked into its _parse_sitemap and _parse_urlset methods)
  2. No.
    I'm working with a hacked SitemapSpider, a typical job sends ~500 requests and got ~10 million URLs to store to HCF. Wouldn't there be ~10 million URLs stored into HcfMiddleware.new_links? What I want is to eliminate this overhead. :)

@nramirezuy
Copy link
Contributor

What do you mean by hacked, what are the changes made to the spider? SitemapSpider is extremely memory consuming because of big sitemap files, which implies massive responses.

I would suggest to not remove new_links just to change it to use the same technique Scheduler uses and make it optional by a setting.

@starrify
Copy link
Member Author

What do you mean by hacked, what are the changes made to the spider? SitemapSpider is extremely memory consuming because of big sitemap files, which implies massive responses.

Setting request.meta["use_hcf"] = True for requests that come from the sitemap. Thus there won't be massive responses, only massive URLs to be stored to HCF.

I would suggest to not remove new_links just to change it to use the same technique Scheduler uses and make it optional by a setting.

Sure. This is exactly what I do in PR #57 :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants