1.Debugging Spiders
2.Common Practices
3.Broad Crawls
6.AutoThrottle extension

本文的目的是详细介绍scrapy,主要翻译自 scrapy官方文档
本文写作日期为2022年9月15日。运行的平台为win10,编辑器为VS Code。


1.Debugging Spiders

(1) scrapy parse 命令。
(2) scrapy.shell 类,使用例:

from scrapy.shell import inspect_response

def parse_details(self, response, item=None):
    if item:
        # populate more `item` fields
        return item
        inspect_response(response, self)

(3) scrapy.utils.response 类,可以打开浏览器:

from scrapy.utils.response import open_in_browser

def parse_details(self, response):
    if "item name" not in response.body:

(5)Spiders Contracts。

2.Common Practices

(1)使用 scrapy crawl 以外的方式运行爬虫,比如运行多个爬虫或全部爬虫:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.spiderloader import SpiderLoader

# 根据项目配置获取 CrawlerProcess 实例
process = CrawlerProcess(get_project_settings())

# 获取 spiderloader 对象,以进一步获取项目下所有爬虫名称
spider_loader = SpiderLoader(get_project_settings())

# 添加全部爬虫
for spidername in spider_loader.list():

# 添加一个爬虫

# 执行

(3)避免被ban ip,原文的建议有:

rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)

disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour

use download delays (2 or higher). See DOWNLOAD_DELAY setting.

if possible, use Google cache to fetch pages, instead of hitting the sites directly

use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh. An open source alternative is scrapoxy, a super proxy that you can attach your own proxies to.

use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Zyte Smart Proxy Manager

3.Broad Crawls


Scrapy defaults are optimized for crawling specific sites. These sites are often handled by a single Scrapy spider, although this is not necessary or required (for example, there are generic spiders that handle any given site thrown at them).

In addition to this “focused crawl”, there is another common type of crawling which covers a large (potentially unlimited) number of domains, and is only limited by time or other arbitrary constraint, rather than stopping when the domain was crawled to completion or when there are no more requests to perform. These are called “broad crawls” and is the typical crawlers employed by search engines.

These are some common properties often found in broad crawls:

they crawl many domains (often, unbounded) instead of a specific set of sites

they don’t necessarily crawl domains to completion, because it would be impractical (or impossible) to do so, and instead limit the crawl by time or number of pages crawled

they are simpler in logic (as opposed to very complex spiders with many extraction rules) because data is often post-processed in a separate stage

they crawl many domains concurrently, which allows them to achieve faster crawl speeds by not being limited by any particular site constraint (each site is crawled slowly to respect politeness, but many sites are crawled in parallel)

As said above, Scrapy default settings are optimized for focused crawls, not broad crawls. However, due to its asynchronous architecture, Scrapy is very well suited for performing fast broad crawls. This page summarizes some things you need to keep in mind when using Scrapy for doing broad crawls, along with concrete suggestions of Scrapy settings to tune in order to achieve an efficient broad crawl.

Scrapy’s default scheduler priority queue is 'scrapy.pqueues.ScrapyPriorityQueue'. It works best during single-domain crawl. It does not work well with crawling many different domains in parallel

To apply the recommended priority queue use:

SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
Increase concurrency¶
Concurrency is the number of requests that are processed in parallel. There is a global limit (CONCURRENT_REQUESTS) and an additional limit that can be set either per domain (CONCURRENT_REQUESTS_PER_DOMAIN) or per IP (CONCURRENT_REQUESTS_PER_IP).


The scheduler priority queue recommended for broad crawls does not support CONCURRENT_REQUESTS_PER_IP.

The default global concurrency limit in Scrapy is not suitable for crawling many different domains in parallel, so you will want to increase it. How much to increase it will depend on how much CPU and memory you crawler will have available.

A good starting point is 100:

But the best way to find out is by doing some trials and identifying at what concurrency your Scrapy process gets CPU bounded. For optimum performance, you should pick a concurrency where CPU usage is at 80-90%.

Increasing concurrency also increases memory usage. If memory usage is a concern, you might need to lower your global concurrency limit accordingly.

Increase Twisted IO thread pool maximum size¶
Currently Scrapy does DNS resolution in a blocking way with usage of thread pool. With higher concurrency levels the crawling could be slow or even fail hitting DNS resolver timeouts. Possible solution to increase the number of threads handling DNS queries. The DNS queue will be processed faster speeding up establishing of connection and crawling overall.

To increase maximum thread pool size use:

Setup your own DNS¶
If you have multiple crawling processes and single central DNS, it can act like DoS attack on the DNS server resulting to slow down of entire network or even blocking your machines. To avoid this setup your own DNS server with local cache and upstream to some large DNS like OpenDNS or Verizon.

Reduce log level¶
When doing broad crawls you are often only interested in the crawl rates you get and any errors found. These stats are reported by Scrapy when using the INFO log level. In order to save CPU (and log storage requirements) you should not use DEBUG log level when preforming large broad crawls in production. Using DEBUG level when developing your (broad) crawler may be fine though.

To set the log level use:

Disable cookies¶
Disable cookies unless you really need. Cookies are often not needed when doing broad crawls (search engine crawlers ignore them), and they improve performance by saving some CPU cycles and reducing the memory footprint of your Scrapy crawler.

To disable cookies use:

Disable retries¶
Retrying failed HTTP requests can slow down the crawls substantially, specially when sites causes are very slow (or fail) to respond, thus causing a timeout error which gets retried many times, unnecessarily, preventing crawler capacity to be reused for other domains.

To disable retries use:

Reduce download timeout¶
Unless you are crawling from a very slow connection (which shouldn’t be the case for broad crawls) reduce the download timeout so that stuck requests are discarded quickly and free up capacity to process the next ones.

To reduce the download timeout use:

Disable redirects¶
Consider disabling redirects, unless you are interested in following them. When doing broad crawls it’s common to save redirects and resolve them when revisiting the site at a later crawl. This also help to keep the number of request constant per crawl batch, otherwise redirect loops may cause the crawler to dedicate too many resources on any specific domain.

To disable redirects use:

Enable crawling of “Ajax Crawlable Pages”¶
Some pages (up to 1%, based on empirical data from year 2013) declare themselves as ajax crawlable. This means they provide plain HTML version of content that is usually available only via AJAX. Pages can indicate it in two ways:

by using #! in URL - this is the default way;

by using a special meta tag - this way is used on “main”, “index” website pages.

Scrapy handles (1) automatically; to handle (2) enable AjaxCrawlMiddleware:

When doing broad crawls it’s common to crawl a lot of “index” web pages; AjaxCrawlMiddleware helps to crawl them correctly. It is turned OFF by default because it has some performance overhead, and enabling it for focused crawls doesn’t make much sense.

Crawl in BFO order¶
Scrapy crawls in DFO order by default.

In broad crawls, however, page crawling tends to be faster than page processing. As a result, unprocessed early requests stay in memory until the final depth is reached, which can significantly increase memory usage.

Crawl in BFO order instead to save memory.

Be mindful of memory leaks¶
If your broad crawl shows a high memory usage, in addition to crawling in BFO order and lowering concurrency you should debug your memory leaks.

Install a specific Twisted reactor¶
If the crawl is exceeding the system’s capabilities, you might want to try installing a specific Twisted reactor, via the TWISTED_REACTOR setting.






The typical workflow, when using the FilesPipeline goes like this:

In a Spider, you scrape an item and put the URLs of the desired into a file_urls field.

The item is returned from the spider and goes to the item pipeline.

When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason).

When the files are downloaded, another field (files) will be populated with the results. This field will contain a list of dicts with information about the downloaded files, such as the downloaded path, the original scraped url (taken from the file_urls field), the file checksum and the file status. The files in the list of the files field will retain the same order of the original file_urls field. If some file failed downloading, an error will be logged and the file won’t be present in the files field.


Using the ImagesPipeline is a lot like using the FilesPipeline, except the default field names used are different: you use image_urls for the image URLs of an item and it will populate an images field for the information about the downloaded images.

The advantage of using the ImagesPipeline for image files is that you can configure some extra functions like generating thumbnails and filtering the images based on their size.

The Images Pipeline requires Pillow 4.0.0 or greater. It is used for thumbnailing and normalizing images to JPEG/RGB format.


FILES_STORE = '/path/to/valid/dir'
IMAGES_STORE = '/path/to/valid/dir'

(2)指定文件名。默认情况下,文件名是以其来源网址进行 SHA-1 hash 后得到的字符串为名,如果是image则加上jpg后缀。但是你也可以在Media管道类中重写 file_path() 方法来自己命名文件,比如对 http://www.example.com/product/images/large/front/0000000004166 ,想命名为 00b08510e4_front.jpg (需要hash一下):

import hashlib
from os.path import splitext

def file_path(self, request, response=None, info=None, *, item=None):
    image_url_hash = hashlib.shake_256(request.url.encode()).hexdigest(5)
    image_perspective = request.url.split('/')[-2]
    image_filename = f'{image_url_hash}_{image_perspective}.jpg'

    return image_filename

自己指定文件名最重要的原则是避免重名,因为这时旧的文件会被覆盖。同时,重写 file_path() 方法也可以用来将图片以原来的格式保存,这一点在保存gif文件时尤其有用,具体见后。
(3)存储。支持的存储有本地存储,FTP服务器,Amazon S3存储,google云,这里只介绍本地存储,以image来说,它会被存储到:



New in version 2.0.

FILES_STORE and IMAGES_STORE can point to an FTP server. Scrapy will automatically upload the files to the server.

FILES_STORE and IMAGES_STORE should be written in one of the following forms:

If username and password are not provided, they are taken from the FTP_USER and FTP_PASSWORD settings respectively.

FTP supports two different connection modes: active or passive. Scrapy uses the passive connection mode by default. To use the active connection mode instead, set the FEED_STORAGE_FTP_ACTIVE setting to True.

(1)在settings.py里打开Media管道,并用 FILES_STOREIMAGES_STORE 设置存放目录。

import scrapy

class MyItem(scrapy.Item):
    # ... other item fields ...
    image_urls = scrapy.Field()
    images = scrapy.Field()

image_urlsimages 也可以在settings中更改:

IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'
FILES_URLS_FIELD = 'field_name_for_your_files_urls'
FILES_RESULT_FIELD = 'field_name_for_your_processed_files'

后面两个是Files管道的。如果你的Media管道不是默认名,则在前面加上该名称,比如你的Images管道类叫做 MyPipeline ,那么在settings里需要指定为:



# 120 days of delay for files expiration

# 30 days of delay for images expiration


    'small': (50, 50),
    'big': (270, 270),








(1) file_path(self, request, response=None, info=None, *, item=None) ,这个方法每次下载item时会调用一次,用于给出存储的文件名。默认情况下它返回的是 full/<request URL hash>.<extension> ,前面要加上settings.py里设置的目录才是绝对路径。比如对于 https://example.com/a/b/c/foo.png 这张图片,使用例如下:

import os
from urllib.parse import urlparse

from scrapy.pipelines.files import FilesPipeline

class MyFilesPipeline(FilesPipeline):

    def file_path(self, request, response=None, info=None, *, item=None):
        return 'files/' + os.path.basename(urlparse(request.url).path)

(2) get_media_requests(item, info) ,前面提到过,管道类会通过item拿到需要下载的url,然后会使用这个方法生成request对象去下载文件。这些request对象完成下载后会以2元素元组的列表格式发送给下面提到的 item_completed() 方法,每个2元素元组的格式为 (success, file_info_or_error) ,前者是bool,表示是否下载成功,后者是一个字典(如果下载成功),包含下载的一些信息,或者是一个twisted的Failure对象(如果下载失败)。另外注意, item_completed() 方法收到的元组列表的顺序和 get_media_requests(item, info) 发出的请求顺序是对应的。
(3) item_completed(results, item, info) ,当一个item的所有下载请求都成功或失败时被调用。它的重写例子:

from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem

def item_completed(self, results, item, info):
    file_paths = [x['path'] for ok, x in results if ok]
    if not file_paths:
        raise DropItem("Item contains no files")
    adapter = ItemAdapter(item)
    adapter['file_paths'] = file_paths
    return item


(1) file_path(self, request, response=None, info=None, *, item=None) 方法,同FilesPipeline。
(2) get_media_requests(item, info) ,和FilesPipeline相比只是更改了field名。
(3) item_completed(results, item, info) ,和FilesPipeline相比只是更改了field名。

6.AutoThrottle extension

它的目的是通过scrapy负载和目标网页服务器的状态自动决定爬取速度。官方文档介绍了它的算法与原理,对于使用来说,你只需要知道在settings.py中设置 AUTOTHROTTLE_ENABLED = True 即可。更详细的设置如下:

Default: False

Enables the AutoThrottle extension.

Default: 5.0

The initial download delay (in seconds).

Default: 60.0

The maximum download delay (in seconds) to be set in case of high latencies.

Default: 1.0

Average number of requests Scrapy should be sending in parallel to remote websites.

By default, AutoThrottle adjusts the delay to send a single concurrent request to each of the remote websites. Set this option to a higher value (e.g. 2.0) to increase the throughput and the load on remote servers. A lower AUTOTHROTTLE_TARGET_CONCURRENCY value (e.g. 0.5) makes the crawler more conservative and polite.

Note that CONCURRENT_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS_PER_IP options are still respected when AutoThrottle extension is enabled. This means that if AUTOTHROTTLE_TARGET_CONCURRENCY is set to a value higher than CONCURRENT_REQUESTS_PER_DOMAIN or CONCURRENT_REQUESTS_PER_IP, the crawler won’t reach this number of concurrent requests.

At every given time point Scrapy can be sending more or less concurrent requests than AUTOTHROTTLE_TARGET_CONCURRENCY; it is a suggested value the crawler tries to approach, not a hard limit.

Default: False

Enable AutoThrottle debug mode which will display stats on every response received, so you can see how the throttling parameters are being adjusted in real time.


使用一个scrapy自带的最简爬虫测试你的硬件上限。简单的 scrapy bench 这一行命令即可,输出中会有几行: [scrapy.extensions.logstats] INFO: Crawled 518 pages (at 2880 pages/min), scraped 0 items (at 0 items/min) ,表示每分钟能爬2880个页面左右。但是从爬虫的复杂度、网页服务器的资源占用等角度考虑,实际环境中不会有这么快的速度。


通过在settings.py中指定 JOBDIR 就能做到这一点,但是要注意这个目录在不同的爬虫之间,或者同一个爬虫的不同实例之间都不能共享。比如你使用 scrapy crawl somespider -s JOBDIR=crawls/somespider-1 (不太清楚官方文档的JOBDIR为什么最后带一个 -1 ),然后对cmd或终端发送 ctrl+C 或其他一些信号时,就能安全地暂停爬虫;之后使用同样的 scrapy crawl 命令即可恢复爬虫。

对于同一项目的多个爬虫,笔者认为可以在爬虫类中使用 custom_settings 这个属性指定 JOBDIR ,但是尚未实践。原文中给出的方法是使用爬虫的state属性:

def parse_item(self, response):
    # parse item here
    self.state['items_count'] = self.state.get('items_count', 0) + 1




This document describes the architecture of Scrapy and how its components interact.

The following diagram shows an overview of the Scrapy architecture with its components and an outline of the data flow that takes place inside the system (shown by the red arrows). A brief description of the components is included below with links for more detailed information about them. The data flow is also described below.

Data flow¶
Scrapy architecture
The data flow in Scrapy is controlled by the execution engine, and goes like this:

The Engine gets the initial Requests to crawl from the Spider.

The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.

The Scheduler returns the next Requests to the Engine.

The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).

Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).

The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).

The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).

The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.

The process repeats (from step 3) until there are no more requests from the Scheduler.

Scrapy Engine¶
The engine is responsible for controlling the data flow between all components of the system, and triggering events when certain actions occur. See the Data Flow section above for more details.

The scheduler receives requests from the engine and enqueues them for feeding them later (also to the engine) when the engine requests them.

The Downloader is responsible for fetching web pages and feeding them to the engine which, in turn, feeds them to the spiders.

Spiders are custom classes written by Scrapy users to parse responses and extract items from them or additional requests to follow. For more information see Spiders.

Item Pipeline¶
The Item Pipeline is responsible for processing the items once they have been extracted (or scraped) by the spiders. Typical tasks include cleansing, validation and persistence (like storing the item in a database). For more information see Item Pipeline.

Downloader middlewares¶
Downloader middlewares are specific hooks that sit between the Engine and the Downloader and process requests when they pass from the Engine to the Downloader, and responses that pass from Downloader to the Engine.

Use a Downloader middleware if you need to do one of the following:

process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);

change received response before passing it to a spider;

send a new Request instead of passing received response to a spider;

pass response to a spider without fetching a web page;

silently drop some requests.

For more information see Downloader Middleware.

Spider middlewares¶
Spider middlewares are specific hooks that sit between the Engine and the Spiders and are able to process spider input (responses) and output (items and requests).

Use a Spider middleware if you need to

post-process output of spider callbacks - change/add/remove requests or items;

post-process start_requests;

handle spider exceptions;

call errback instead of callback for some of the requests based on response content.

For more information see Spider Middleware.

Event-driven networking¶
Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it’s implemented using a non-blocking (aka asynchronous) code for concurrency.

For more information about asynchronous programming and Twisted see these links:

Introduction to Deferreds

Twisted - hello, asynchronous programming

Twisted Introduction - Krondo


在settings.py中可以打开下载中间件,它的值越小表示离 process_request() 方法越早被唤醒、 process_response() 方法越晚被唤醒,或者理解为值越小离engine越近。这个值是比较有意义的,因为有时候你自己的下载中间件需要依赖之前的一些东西。默认的顺序如下:

    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,



另外,还有一些子类,如 scrapy.downloadermiddlewares.cookies.CookiesMiddleware ,用来保持session、存放cookies。你可以在settings.py中通过 COOKIES_ENABLEDCOOKIES_DEBUG 来对它进行行为设置(默认前者打开,后者关闭)。

默认情况下,一个项目中的爬虫使用同一cookies,你可以使用request对象的 meta 里的 cookiejar 键来修改这一行为(一个项目中的不同爬虫使用各自的cookies):

for i, url in enumerate(urls):
    yield scrapy.Request(url, meta={'cookiejar': i},


def parse_page(self, response):
    # do some processing
    return scrapy.Request("http://www.example.com/otherpage",
        meta={'cookiejar': response.meta['cookiejar']},

另外,还有DefaultHeadersMiddleware、DownloadTimeoutMiddleware、HttpAuthMiddleware、HttpCacheMiddleware、HttpCompressionMiddleware、HttpProxyMiddleware、RedirectMiddleware和其他各种 DOWNLOADER_MIDDLEWARES_BASE 里出现的下载中间件,这里不介绍了。





The extensions framework provides a mechanism for inserting your own custom functionality into Scrapy.

Extensions are just regular classes.

Extension settings¶
Extensions use the Scrapy settings to manage their settings, just like any other Scrapy code.

It is customary for extensions to prefix their settings with their own name, to avoid collision with existing (and future) extensions. For example, a hypothetic extension to handle Google Sitemaps would use settings like GOOGLESITEMAP_ENABLED, GOOGLESITEMAP_DEPTH, and so on.

Loading & activating extensions¶
Extensions are loaded and activated at startup by instantiating a single instance of the extension class per spider being run. All the extension initialization code must be performed in the class __init__ method.

To make an extension available, add it to the EXTENSIONS setting in your Scrapy settings. In EXTENSIONS, each extension is represented by a string: the full Python path to the extension’s class name. For example:

    'scrapy.extensions.corestats.CoreStats': 500,
    'scrapy.extensions.telnet.TelnetConsole': 500,
As you can see, the EXTENSIONS setting is a dict where the keys are the extension paths, and their values are the orders, which define the extension loading order. The EXTENSIONS setting is merged with the EXTENSIONS_BASE setting defined in Scrapy (and not meant to be overridden) and then sorted by order to get the final sorted list of enabled extensions.

As extensions typically do not depend on each other, their loading order is irrelevant in most cases. This is why the EXTENSIONS_BASE setting defines all extensions with the same order (0). However, this feature can be exploited if you need to add an extension which depends on other extensions already loaded.

Available, enabled and disabled extensions¶
Not all available extensions will be enabled. Some of them usually depend on a particular setting. For example, the HTTP Cache extension is available by default but disabled unless the HTTPCACHE_ENABLED setting is set.

Disabling an extension¶
In order to disable an extension that comes enabled by default (i.e. those included in the EXTENSIONS_BASE setting) you must set its order to None. For example:

    'scrapy.extensions.corestats.CoreStats': None,
Writing your own extension¶
Each extension is a Python class. The main entry point for a Scrapy extension (this also includes middlewares and pipelines) is the from_crawler class method which receives a Crawler instance. Through the Crawler object you can access settings, signals, stats, and also control the crawling behaviour.

Typically, extensions connect to signals and perform tasks triggered by them.

Finally, if the from_crawler method raises the NotConfigured exception, the extension will be disabled. Otherwise, the extension will be enabled.

Sample extension¶
Here we will implement a simple extension to illustrate the concepts described in the previous section. This extension will log a message every time:

a spider is opened

a spider is closed

a specific number of items are scraped

The extension will be enabled through the MYEXT_ENABLED setting and the number of items will be specified through the MYEXT_ITEMCOUNT setting.

Here is the code of such extension:

import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured

logger = logging.getLogger(__name__)

class SpiderOpenCloseLogging:

    def __init__(self, item_count):
        self.item_count = item_count
        self.items_scraped = 0

    def from_crawler(cls, crawler):
        # first check if the extension should be enabled and raise
        # NotConfigured otherwise
        if not crawler.settings.getbool('MYEXT_ENABLED'):
            raise NotConfigured

        # get the number of items from settings
        item_count = crawler.settings.getint('MYEXT_ITEMCOUNT', 1000)

        # instantiate the extension object
        ext = cls(item_count)

        # connect the extension object to signals
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)

        # return the extension object
        return ext

    def spider_opened(self, spider):
        logger.info("opened spider %s", spider.name)

    def spider_closed(self, spider):
        logger.info("closed spider %s", spider.name)

    def item_scraped(self, item, spider):
        self.items_scraped += 1
        if self.items_scraped % self.item_count == 0:
            logger.info("scraped %d items", self.items_scraped)
Built-in extensions reference¶
General purpose extensions¶
Log Stats extension¶
Log basic stats like crawled pages and scraped items.

Core Stats extension¶
Enable the collection of core statistics, provided the stats collection is enabled (see Stats Collection).

Telnet console extension¶
Provides a telnet console for getting into a Python interpreter inside the currently running Scrapy process, which can be very useful for debugging.

The telnet console must be enabled by the TELNETCONSOLE_ENABLED setting, and the server will listen in the port specified in TELNETCONSOLE_PORT.

Memory usage extension¶

This extension does not work in Windows.

Monitors the memory used by the Scrapy process that runs the spider and:

sends a notification e-mail when it exceeds a certain value

closes the spider when it exceeds a certain value

The notification e-mails can be triggered when a certain warning value is reached (MEMUSAGE_WARNING_MB) and when the maximum value is reached (MEMUSAGE_LIMIT_MB) which will also cause the spider to be closed and the Scrapy process to be terminated.

This extension is enabled by the MEMUSAGE_ENABLED setting and can be configured with the following settings:





Memory debugger extension¶
An extension for debugging memory usage. It collects information about:

objects uncollected by the Python garbage collector

objects left alive that shouldn’t. For more info, see Debugging memory leaks with trackref

To enable this extension, turn on the MEMDEBUG_ENABLED setting. The info will be stored in the stats.

Close spider extension¶
Closes a spider automatically when some conditions are met, using a specific closing reason for each condition.

The conditions for closing a spider can be configured through the following settings:






When a certain closing condition is met, requests which are currently in the downloader queue (up to CONCURRENT_REQUESTS requests) are still processed.

Default: 0

An integer which specifies a number of seconds. If the spider remains open for more than that number of second, it will be automatically closed with the reason closespider_timeout. If zero (or non set), spiders won’t be closed by timeout.

Default: 0

An integer which specifies a number of items. If the spider scrapes more than that amount and those items are passed by the item pipeline, the spider will be closed with the reason closespider_itemcount. If zero (or non set), spiders won’t be closed by number of passed items.

Default: 0

An integer which specifies the maximum number of responses to crawl. If the spider crawls more than that, the spider will be closed with the reason closespider_pagecount. If zero (or non set), spiders won’t be closed by number of crawled responses.

Default: 0

An integer which specifies the maximum number of errors to receive before closing the spider. If the spider generates more than that number of errors, it will be closed with the reason closespider_errorcount. If zero (or non set), spiders won’t be closed by number of errors.

StatsMailer extension¶
This simple extension can be used to send a notification e-mail every time a domain has finished scraping, including the Scrapy stats collected. The email will be sent to all recipients specified in the STATSMAILER_RCPTS setting.

Emails can be sent using the MailSender class. To see a full list of parameters, including examples on how to instantiate MailSender and use mail settings, see Sending e-mail.

Debugging extensions¶
Stack trace dump extension¶
Dumps information about the running process when a SIGQUIT or SIGUSR2 signal is received. The information dumped is the following:

engine status (using scrapy.utils.engine.get_engine_status())

live references (see Debugging memory leaks with trackref)

stack trace of all threads

After the stack trace and engine status is dumped, the Scrapy process continues running normally.

This extension only works on POSIX-compliant platforms (i.e. not Windows), because the SIGQUIT and SIGUSR2 signals are not available on Windows.

There are at least two ways to send Scrapy the SIGQUIT signal:

By pressing Ctrl-while a Scrapy process is running (Linux only?)

By running this command (assuming <pid> is the process id of the Scrapy process):

kill -QUIT <pid>
Debugger extension¶
Invokes a Python debugger inside a running Scrapy process when a SIGUSR2 signal is received. After the debugger is exited, the Scrapy process continues running normally.

For more info see Debugging in Python.

This extension only works on POSIX-compliant platforms (i.e. not Windows).


包括处理动态生成的数据(官方文档讲的很浅),处理内存泄漏,分布式部署爬虫(scrapyd),对Coroutines(协程)和asyncio(异步读写)的支持,各组件的API,scheduler,Item Exporters(导出为XML、CSV、JSON及JSON LINE等格式),这里就不介绍了。

标签: python, scrapy