scrapy-examples

Multifarious scrapy examples with integrated proxies and agents, which make you comfy to write a spider.

Don't use it to do anything illegal!

##Real spider example: doubanbook

####Tutorial

git clone https://github.com/geekan/scrapy-examples
cd scrapy-examples/doubanbook
scrapy crawl doubanbook

####Depth

There are several depths in the spider, and the spider gets real data from depth2.

Depth0: The entrance is http://book.douban.com/tag/
Depth1: Urls like http://book.douban.com/tag/外国文学 from depth0
Depth2: Urls like http://book.douban.com/subject/1770782/ from depth1

####Example image

##Avaiable Spiders

tutorial
- dmoz_item
- douban_book
- page_recorder
- douban_tag_book
doubanbook
linkedin
hrtencent
sis
zhihu
alexa
- alexa
- alexa.cn

Advanced

Use parse_with_rules to write a spider quickly.
See dmoz spider for more details.
Proxies
- If you don't want to use proxy, just comment the proxy middleware in settings.
- If you want to custom it, hack misc/proxy.py by yourself.
Notice
- Don't use parse as your method name, it's an inner method of CrawlSpider.

Advanced Usage

Run ./startproject.sh <PROJECT> to start a new project.
It will automatically generate most things, the only left things are:
- PROJECT/PROJECT/items.py
- PROJECT/PROJECT/spider/spider.py

Example to hack `items.py` and `spider.py`

Hacked items.py with additional fields url and description:

from scrapy.item import Item, Field

class exampleItem(Item):
    url = Field()
    name = Field()
    description = Field()

Hacked spider.py with start rules and css rules (here only display the class exampleSpider):

class exampleSpider(CommonSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.com/",
    ]
    # Crawler would start on start_urls, and follow the valid urls allowed by below rules.
    rules = [
        Rule(sle(allow=["/Arts/", "/Games/"]), callback='parse', follow=True),
    ]

    css_rules = {
        '.directory-url li': {
            '__use': 'dump', # dump data directly
            '__list': True, # it's a list
            'url': 'li > a::attr(href)',
            'name': 'a::text',
            'description': 'li::text',
        }
    }

    def parse(self, response):
        info('Parse '+response.url)
        # parse_with_rules is implemented here:
        #   https://github.com/geekan/scrapy-examples/blob/master/misc/spider.py
        self.parse_with_rules(response, self.css_rules, exampleItem)

Name		Name	Last commit message	Last commit date
Latest commit History 260 Commits
alexa		alexa
alexa_topsites		alexa_topsites
amazonbook		amazonbook
dmoz		dmoz
doubanbook		doubanbook
doubanmovie		doubanmovie
douyu		douyu
general_spider		general_spider
github_trending		github_trending
googlescholar		googlescholar
hacker_news		hacker_news
hrtencent		hrtencent
linkedin		linkedin
misc		misc
pandatv		pandatv
proxylist		proxylist
qqnews		qqnews
reddit		reddit
sinanews		sinanews
sis		sis
template		template
tutorial		tutorial
underdev		underdev
v2ex		v2ex
youtube_trending		youtube_trending
zhibo8		zhibo8
zhihu		zhihu
.gitignore		.gitignore
README.md		README.md
clean.sh		clean.sh
delay.sh		delay.sh
startproject.sh		startproject.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrapy-examples

Advanced

Advanced Usage

Example to hack `items.py` and `spider.py`

About

Releases

Packages

Languages

acrowther/scrapy-examples

Folders and files

Latest commit

History

Repository files navigation

scrapy-examples

Advanced

Advanced Usage

Example to hack items.py and spider.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example to hack `items.py` and `spider.py`

Packages