Python,爬虫与深度学习(14)——scrapy详解(二)
本文较长,属于系统性总结,可作为使用手册查阅。
步骤/目录:
1.Debugging Spiders
2.Common Practices
3.Broad Crawls
4.使用浏览器的开发者工具
5.下载图片与文件
6.AutoThrottle extension
7.Benchmarking
8.爬虫的暂停与恢复
9.scrapy架构
10.下载中间件
11.爬虫中间件
12.添加extensions
13.其他
本文首发于个人博客https://lisper517.top/index.php/archives/52/
,转载请注明出处。
本文的目的是详细介绍scrapy,主要翻译自 scrapy官方文档 。
本文写作日期为2022年9月15日。运行的平台为win10,编辑器为VS Code。
在上一篇文章中,介绍了scrapy的基础概念,现在讲解一些进阶的用法和技巧。
1.Debugging Spiders
介绍了一些调试爬虫的方式,主要有:
(1) scrapy parse
命令。
(2) scrapy.shell
类,使用例:
from scrapy.shell import inspect_response
def parse_details(self, response, item=None):
if item:
# populate more `item` fields
return item
else:
inspect_response(response, self)
(3) scrapy.utils.response
类,可以打开浏览器:
from scrapy.utils.response import open_in_browser
def parse_details(self, response):
if "item name" not in response.body:
open_in_browser(response)
(4)Logging。
(5)Spiders Contracts。
2.Common Practices
如何使用爬虫。内容包括:
(1)使用 scrapy crawl
以外的方式运行爬虫,比如运行多个爬虫或全部爬虫:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.spiderloader import SpiderLoader
# 根据项目配置获取 CrawlerProcess 实例
process = CrawlerProcess(get_project_settings())
# 获取 spiderloader 对象,以进一步获取项目下所有爬虫名称
spider_loader = SpiderLoader(get_project_settings())
# 添加全部爬虫
for spidername in spider_loader.list():
process.crawl(spidername)
# 添加一个爬虫
#process.crawl('爬虫名')
# 执行
process.start()
(2)分布式爬取,涉及scrapyd。
(3)避免被ban ip,原文的建议有:
rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
use download delays (2 or higher). See DOWNLOAD_DELAY setting.
if possible, use Google cache to fetch pages, instead of hitting the sites directly
use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh. An open source alternative is scrapoxy, a super proxy that you can attach your own proxies to.
use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Zyte Smart Proxy Manager
3.Broad Crawls
目前为止介绍的都是focus爬虫,即聚焦于某一个站点。如果是google、baidu等搜索引擎,它们会用Broad爬虫,这部分介绍了如何把scrapy改造成Broad爬虫,有兴趣的可阅读原文。
Scrapy defaults are optimized for crawling specific sites. These sites are often handled by a single Scrapy spider, although this is not necessary or required (for example, there are generic spiders that handle any given site thrown at them).
In addition to this “focused crawl”, there is another common type of crawling which covers a large (potentially unlimited) number of domains, and is only limited by time or other arbitrary constraint, rather than stopping when the domain was crawled to completion or when there are no more requests to perform. These are called “broad crawls” and is the typical crawlers employed by search engines.
These are some common properties often found in broad crawls:
they crawl many domains (often, unbounded) instead of a specific set of sites
they don’t necessarily crawl domains to completion, because it would be impractical (or impossible) to do so, and instead limit the crawl by time or number of pages crawled
they are simpler in logic (as opposed to very complex spiders with many extraction rules) because data is often post-processed in a separate stage
they crawl many domains concurrently, which allows them to achieve faster crawl speeds by not being limited by any particular site constraint (each site is crawled slowly to respect politeness, but many sites are crawled in parallel)
As said above, Scrapy default settings are optimized for focused crawls, not broad crawls. However, due to its asynchronous architecture, Scrapy is very well suited for performing fast broad crawls. This page summarizes some things you need to keep in mind when using Scrapy for doing broad crawls, along with concrete suggestions of Scrapy settings to tune in order to achieve an efficient broad crawl.
Use the right SCHEDULER_PRIORITY_QUEUE¶
Scrapy’s default scheduler priority queue is 'scrapy.pqueues.ScrapyPriorityQueue'. It works best during single-domain crawl. It does not work well with crawling many different domains in parallel
To apply the recommended priority queue use:
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
Increase concurrency¶
Concurrency is the number of requests that are processed in parallel. There is a global limit (CONCURRENT_REQUESTS) and an additional limit that can be set either per domain (CONCURRENT_REQUESTS_PER_DOMAIN) or per IP (CONCURRENT_REQUESTS_PER_IP).
Note
The scheduler priority queue recommended for broad crawls does not support CONCURRENT_REQUESTS_PER_IP.
The default global concurrency limit in Scrapy is not suitable for crawling many different domains in parallel, so you will want to increase it. How much to increase it will depend on how much CPU and memory you crawler will have available.
A good starting point is 100:
CONCURRENT_REQUESTS = 100
But the best way to find out is by doing some trials and identifying at what concurrency your Scrapy process gets CPU bounded. For optimum performance, you should pick a concurrency where CPU usage is at 80-90%.
Increasing concurrency also increases memory usage. If memory usage is a concern, you might need to lower your global concurrency limit accordingly.
Increase Twisted IO thread pool maximum size¶
Currently Scrapy does DNS resolution in a blocking way with usage of thread pool. With higher concurrency levels the crawling could be slow or even fail hitting DNS resolver timeouts. Possible solution to increase the number of threads handling DNS queries. The DNS queue will be processed faster speeding up establishing of connection and crawling overall.
To increase maximum thread pool size use:
REACTOR_THREADPOOL_MAXSIZE = 20
Setup your own DNS¶
If you have multiple crawling processes and single central DNS, it can act like DoS attack on the DNS server resulting to slow down of entire network or even blocking your machines. To avoid this setup your own DNS server with local cache and upstream to some large DNS like OpenDNS or Verizon.
Reduce log level¶
When doing broad crawls you are often only interested in the crawl rates you get and any errors found. These stats are reported by Scrapy when using the INFO log level. In order to save CPU (and log storage requirements) you should not use DEBUG log level when preforming large broad crawls in production. Using DEBUG level when developing your (broad) crawler may be fine though.
To set the log level use:
LOG_LEVEL = 'INFO'
Disable cookies¶
Disable cookies unless you really need. Cookies are often not needed when doing broad crawls (search engine crawlers ignore them), and they improve performance by saving some CPU cycles and reducing the memory footprint of your Scrapy crawler.
To disable cookies use:
COOKIES_ENABLED = False
Disable retries¶
Retrying failed HTTP requests can slow down the crawls substantially, specially when sites causes are very slow (or fail) to respond, thus causing a timeout error which gets retried many times, unnecessarily, preventing crawler capacity to be reused for other domains.
To disable retries use:
RETRY_ENABLED = False
Reduce download timeout¶
Unless you are crawling from a very slow connection (which shouldn’t be the case for broad crawls) reduce the download timeout so that stuck requests are discarded quickly and free up capacity to process the next ones.
To reduce the download timeout use:
DOWNLOAD_TIMEOUT = 15
Disable redirects¶
Consider disabling redirects, unless you are interested in following them. When doing broad crawls it’s common to save redirects and resolve them when revisiting the site at a later crawl. This also help to keep the number of request constant per crawl batch, otherwise redirect loops may cause the crawler to dedicate too many resources on any specific domain.
To disable redirects use:
REDIRECT_ENABLED = False
Enable crawling of “Ajax Crawlable Pages”¶
Some pages (up to 1%, based on empirical data from year 2013) declare themselves as ajax crawlable. This means they provide plain HTML version of content that is usually available only via AJAX. Pages can indicate it in two ways:
by using #! in URL - this is the default way;
by using a special meta tag - this way is used on “main”, “index” website pages.
Scrapy handles (1) automatically; to handle (2) enable AjaxCrawlMiddleware:
AJAXCRAWL_ENABLED = True
When doing broad crawls it’s common to crawl a lot of “index” web pages; AjaxCrawlMiddleware helps to crawl them correctly. It is turned OFF by default because it has some performance overhead, and enabling it for focused crawls doesn’t make much sense.
Crawl in BFO order¶
Scrapy crawls in DFO order by default.
In broad crawls, however, page crawling tends to be faster than page processing. As a result, unprocessed early requests stay in memory until the final depth is reached, which can significantly increase memory usage.
Crawl in BFO order instead to save memory.
Be mindful of memory leaks¶
If your broad crawl shows a high memory usage, in addition to crawling in BFO order and lowering concurrency you should debug your memory leaks.
Install a specific Twisted reactor¶
If the crawl is exceeding the system’s capabilities, you might want to try installing a specific Twisted reactor, via the TWISTED_REACTOR setting.
4.使用浏览器的开发者工具
永远不要在xpath表达式里带tbody标签。浏览器对页面的源码会做一些修改(排版美化,执行js)、scrapy不会,这将导致浏览器看到的源码和scrapy中不同,其中一个点就是firefox浏览器喜欢在表格中加tbody标签。其他的部分笔者也不介绍了,其他地方有更详细的使用开发者工具教程,最好是找一些图文教程或视频教程看。
5.下载图片与文件
使用image管道和files管道。它们的特点有:避免重复下载(会scrapy会给文件生成hash身份码用于去重),指定下载地点(本地目录,FTP服务器,google云)。image管道还有另外的特点,包括自动转换图片为jpg格式、RGB模式(这对下载gif图片有些麻烦),生成图片指纹(thumbnail),对下载图片的长、宽进行限制。
官方文档关于使用Files管道的流程解说:
The typical workflow, when using the FilesPipeline goes like this:
In a Spider, you scrape an item and put the URLs of the desired into a file_urls field.
The item is returned from the spider and goes to the item pipeline.
When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason).
When the files are downloaded, another field (files) will be populated with the results. This field will contain a list of dicts with information about the downloaded files, such as the downloaded path, the original scraped url (taken from the file_urls field), the file checksum and the file status. The files in the list of the files field will retain the same order of the original file_urls field. If some file failed downloading, an error will be logged and the file won’t be present in the files field.
关于Images管道的介绍:
Using the ImagesPipeline is a lot like using the FilesPipeline, except the default field names used are different: you use image_urls for the image URLs of an item and it will populate an images field for the information about the downloaded images.
The advantage of using the ImagesPipeline for image files is that you can configure some extra functions like generating thumbnails and filtering the images based on their size.
The Images Pipeline requires Pillow 4.0.0 or greater. It is used for thumbnailing and normalizing images to JPEG/RGB format.
使用Media管道的流程如下:
(1)在settings.py中修改默认的ITEM_PIPELINES,加上Media管道。然后设置存储目录(不设置存储目录的话也不会下载),比如:
FILES_STORE = '/path/to/valid/dir'
IMAGES_STORE = '/path/to/valid/dir'
(2)指定文件名。默认情况下,文件名是以其来源网址进行 SHA-1 hash 后得到的字符串为名,如果是image则加上jpg后缀。但是你也可以在Media管道类中重写 file_path()
方法来自己命名文件,比如对 http://www.example.com/product/images/large/front/0000000004166
,想命名为 00b08510e4_front.jpg
(需要hash一下):
import hashlib
from os.path import splitext
def file_path(self, request, response=None, info=None, *, item=None):
image_url_hash = hashlib.shake_256(request.url.encode()).hexdigest(5)
image_perspective = request.url.split('/')[-2]
image_filename = f'{image_url_hash}_{image_perspective}.jpg'
return image_filename
自己指定文件名最重要的原则是避免重名,因为这时旧的文件会被覆盖。同时,重写 file_path()
方法也可以用来将图片以原来的格式保存,这一点在保存gif文件时尤其有用,具体见后。
(3)存储。支持的存储有本地存储,FTP服务器,Amazon S3存储,google云,这里只介绍本地存储,以image来说,它会被存储到:
<IMAGES_STORE>/full/<FILE_NAME>
这个目录下。其中full是因为要把thumbnail和图片分开存储而存在的。
关于FTP服务器存储,请看原文:
New in version 2.0.
FILES_STORE and IMAGES_STORE can point to an FTP server. Scrapy will automatically upload the files to the server.
FILES_STORE and IMAGES_STORE should be written in one of the following forms:
ftp://username:password@address:port/path
ftp://address:port/path
If username and password are not provided, they are taken from the FTP_USER and FTP_PASSWORD settings respectively.
FTP supports two different connection modes: active or passive. Scrapy uses the passive connection mode by default. To use the active connection mode instead, set the FEED_STORAGE_FTP_ACTIVE setting to True.
修改文件的流程如下:
(1)在settings.py里打开Media管道,并用 FILES_STORE
和 IMAGES_STORE
设置存放目录。
(2)在items.py里添加专门爬取Media的field,比如:
import scrapy
class MyItem(scrapy.Item):
# ... other item fields ...
image_urls = scrapy.Field()
images = scrapy.Field()
image_urls
和 images
也可以在settings中更改:
IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'
FILES_URLS_FIELD = 'field_name_for_your_files_urls'
FILES_RESULT_FIELD = 'field_name_for_your_processed_files'
后面两个是Files管道的。如果你的Media管道不是默认名,则在前面加上该名称,比如你的Images管道类叫做 MyPipeline
,那么在settings里需要指定为:
MYPIPELINE_IMAGES_URLS_FIELD =
MYPIPELINE_IMAGES_RESULT_FIELD =
MYPIPELINE_FILES_URLS_FIELD =
MYPIPELINE_FILES_RESULT_FIELD =
最后,讲解一些其余特征:
(1)对近期(90天)内下载过的文件或图片,scrapy不会重复下载,这段时间的长短可以在settings.py中调整:
# 120 days of delay for files expiration
FILES_EXPIRES = 120
MYPIPELINE_FILES_EXPIRES = 120
# 30 days of delay for images expiration
IMAGES_EXPIRES = 30
MYPIPELINE_IMAGES_EXPIRES = 30
注意它们的单位是天。scrapy会将文件最后修改时间与现在时间比较,来判断一个文件是否在这个下载保护期内。
(2)设置图片指纹,如:
IMAGES_THUMBS = {
'small': (50, 50),
'big': (270, 270),
}
这里的键是指纹名,后面则是指纹图片长宽。如上设置时,一张图片会存储三个文件:
<IMAGES_STORE>/full/图片名.jpg
<IMAGES_STORE>/thumbs/small/图片名.jpg
<IMAGES_STORE>/thumbs/big/图片名.jpg
(3)过滤小图片。格式如下:
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
如果对长宽都有限制,则图片要满足这两个限制才会被下载。默认情况下没有图片长宽限制。
(4)允许重定向。默认情况下,Media管道发出的请求被重定向时scrapy会认为下载失败了。使用:
MEDIA_ALLOW_REDIRECTS = True
则允许Media管道下载重定向。
重写Media管道类的方法,对于Files管道类,方法有:
(1) file_path(self, request, response=None, info=None, *, item=None)
,这个方法每次下载item时会调用一次,用于给出存储的文件名。默认情况下它返回的是 full/<request URL hash>.<extension>
,前面要加上settings.py里设置的目录才是绝对路径。比如对于 https://example.com/a/b/c/foo.png
这张图片,使用例如下:
import os
from urllib.parse import urlparse
from scrapy.pipelines.files import FilesPipeline
class MyFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
return 'files/' + os.path.basename(urlparse(request.url).path)
(2) get_media_requests(item, info)
,前面提到过,管道类会通过item拿到需要下载的url,然后会使用这个方法生成request对象去下载文件。这些request对象完成下载后会以2元素元组的列表格式发送给下面提到的 item_completed()
方法,每个2元素元组的格式为 (success, file_info_or_error)
,前者是bool,表示是否下载成功,后者是一个字典(如果下载成功),包含下载的一些信息,或者是一个twisted的Failure对象(如果下载失败)。另外注意, item_completed()
方法收到的元组列表的顺序和 get_media_requests(item, info)
发出的请求顺序是对应的。
(3) item_completed(results, item, info)
,当一个item的所有下载请求都成功或失败时被调用。它的重写例子:
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
def item_completed(self, results, item, info):
file_paths = [x['path'] for ok, x in results if ok]
if not file_paths:
raise DropItem("Item contains no files")
adapter = ItemAdapter(item)
adapter['file_paths'] = file_paths
return item
注意这个方法需要返回item,因为它要把下载的文件交给Media管道处理。
Images管道类时Files管道类的扩展,修改了field名、添加了一些功能。
(1) file_path(self, request, response=None, info=None, *, item=None)
方法,同FilesPipeline。
(2) get_media_requests(item, info)
,和FilesPipeline相比只是更改了field名。
(3) item_completed(results, item, info)
,和FilesPipeline相比只是更改了field名。
6.AutoThrottle extension
它的目的是通过scrapy负载和目标网页服务器的状态自动决定爬取速度。官方文档介绍了它的算法与原理,对于使用来说,你只需要知道在settings.py中设置 AUTOTHROTTLE_ENABLED = True
即可。更详细的设置如下:
AUTOTHROTTLE_ENABLED¶
Default: False
Enables the AutoThrottle extension.
AUTOTHROTTLE_START_DELAY¶
Default: 5.0
The initial download delay (in seconds).
AUTOTHROTTLE_MAX_DELAY¶
Default: 60.0
The maximum download delay (in seconds) to be set in case of high latencies.
AUTOTHROTTLE_TARGET_CONCURRENCY¶
Default: 1.0
Average number of requests Scrapy should be sending in parallel to remote websites.
By default, AutoThrottle adjusts the delay to send a single concurrent request to each of the remote websites. Set this option to a higher value (e.g. 2.0) to increase the throughput and the load on remote servers. A lower AUTOTHROTTLE_TARGET_CONCURRENCY value (e.g. 0.5) makes the crawler more conservative and polite.
Note that CONCURRENT_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS_PER_IP options are still respected when AutoThrottle extension is enabled. This means that if AUTOTHROTTLE_TARGET_CONCURRENCY is set to a value higher than CONCURRENT_REQUESTS_PER_DOMAIN or CONCURRENT_REQUESTS_PER_IP, the crawler won’t reach this number of concurrent requests.
At every given time point Scrapy can be sending more or less concurrent requests than AUTOTHROTTLE_TARGET_CONCURRENCY; it is a suggested value the crawler tries to approach, not a hard limit.
AUTOTHROTTLE_DEBUG¶
Default: False
Enable AutoThrottle debug mode which will display stats on every response received, so you can see how the throttling parameters are being adjusted in real time.
7.Benchmarking
使用一个scrapy自带的最简爬虫测试你的硬件上限。简单的 scrapy bench
这一行命令即可,输出中会有几行: [scrapy.extensions.logstats] INFO: Crawled 518 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
,表示每分钟能爬2880个页面左右。但是从爬虫的复杂度、网页服务器的资源占用等角度考虑,实际环境中不会有这么快的速度。
8.爬虫的暂停与恢复
通过scheduler在硬盘中存储进度,去重器在硬盘上存储去重信息,加上能保存爬虫信息的一些扩展(extensions),可以做到暂停、恢复爬虫(中间可以关机的那种),这对大型网站的爬取尤其重要。
通过在settings.py中指定 JOBDIR
就能做到这一点,但是要注意这个目录在不同的爬虫之间,或者同一个爬虫的不同实例之间都不能共享。比如你使用 scrapy crawl somespider -s JOBDIR=crawls/somespider-1
(不太清楚官方文档的JOBDIR为什么最后带一个 -1
),然后对cmd或终端发送 ctrl+C 或其他一些信号时,就能安全地暂停爬虫;之后使用同样的 scrapy crawl
命令即可恢复爬虫。
对于同一项目的多个爬虫,笔者认为可以在爬虫类中使用 custom_settings
这个属性指定 JOBDIR
,但是尚未实践。原文中给出的方法是使用爬虫的state属性:
def parse_item(self, response):
# parse item here
self.state['items_count'] = self.state.get('items_count', 0) + 1
最后需要提醒两点:(1)如果你的爬虫依赖Cookies,由于Cookies是会过期的,所以最好在暂停爬虫后不久就继续爬虫。另一个催促你尽快继续爬虫的因素是网页有可能会更新,导致你之前爬的数据有变化、过时或失效(比如爬取网页上提供的股市信息,而这个股市信息是每天更新的、只放出当天的)。(2)request对象也需要保存,这时会使用pickle模块,这就要求request对象是能用pickle序列化的。
9.scrapy架构
其实这个部分很尴尬,对初学者来说放在第一课感觉太枯燥,对老手来说没有必要讲。这里复制原文如下,图片自己随便找一张scrapy架构图看:
This document describes the architecture of Scrapy and how its components interact.
Overview¶
The following diagram shows an overview of the Scrapy architecture with its components and an outline of the data flow that takes place inside the system (shown by the red arrows). A brief description of the components is included below with links for more detailed information about them. The data flow is also described below.
Data flow¶
Scrapy architecture
The data flow in Scrapy is controlled by the execution engine, and goes like this:
The Engine gets the initial Requests to crawl from the Spider.
The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
The Scheduler returns the next Requests to the Engine.
The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
The process repeats (from step 3) until there are no more requests from the Scheduler.
Components¶
Scrapy Engine¶
The engine is responsible for controlling the data flow between all components of the system, and triggering events when certain actions occur. See the Data Flow section above for more details.
Scheduler¶
The scheduler receives requests from the engine and enqueues them for feeding them later (also to the engine) when the engine requests them.
Downloader¶
The Downloader is responsible for fetching web pages and feeding them to the engine which, in turn, feeds them to the spiders.
Spiders¶
Spiders are custom classes written by Scrapy users to parse responses and extract items from them or additional requests to follow. For more information see Spiders.
Item Pipeline¶
The Item Pipeline is responsible for processing the items once they have been extracted (or scraped) by the spiders. Typical tasks include cleansing, validation and persistence (like storing the item in a database). For more information see Item Pipeline.
Downloader middlewares¶
Downloader middlewares are specific hooks that sit between the Engine and the Downloader and process requests when they pass from the Engine to the Downloader, and responses that pass from Downloader to the Engine.
Use a Downloader middleware if you need to do one of the following:
process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);
change received response before passing it to a spider;
send a new Request instead of passing received response to a spider;
pass response to a spider without fetching a web page;
silently drop some requests.
For more information see Downloader Middleware.
Spider middlewares¶
Spider middlewares are specific hooks that sit between the Engine and the Spiders and are able to process spider input (responses) and output (items and requests).
Use a Spider middleware if you need to
post-process output of spider callbacks - change/add/remove requests or items;
post-process start_requests;
handle spider exceptions;
call errback instead of callback for some of the requests based on response content.
For more information see Spider Middleware.
Event-driven networking¶
Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it’s implemented using a non-blocking (aka asynchronous) code for concurrency.
For more information about asynchronous programming and Twisted see these links:
Introduction to Deferreds
Twisted - hello, asynchronous programming
Twisted Introduction - Krondo
10.下载中间件
在settings.py中可以打开下载中间件,它的值越小表示离 process_request()
方法越早被唤醒、 process_response()
方法越晚被唤醒,或者理解为值越小离engine越近。这个值是比较有意义的,因为有时候你自己的下载中间件需要依赖之前的一些东西。默认的顺序如下:
DOWNLOADER_MIDDLEWARES_BASE = {
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}
如果值设置为None,表示禁用这个中间件。
如果要自己修改下载中间件,可以参考middlewares.py里的注释。
另外,还有一些子类,如 scrapy.downloadermiddlewares.cookies.CookiesMiddleware
,用来保持session、存放cookies。你可以在settings.py中通过 COOKIES_ENABLED
和 COOKIES_DEBUG
来对它进行行为设置(默认前者打开,后者关闭)。
默认情况下,一个项目中的爬虫使用同一cookies,你可以使用request对象的 meta
里的 cookiejar
键来修改这一行为(一个项目中的不同爬虫使用各自的cookies):
for i, url in enumerate(urls):
yield scrapy.Request(url, meta={'cookiejar': i},
callback=self.parse_page)
cookiejar这个键不太持久,每次在parse系列方法中生成request对象都要手动指定一次:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
另外,还有DefaultHeadersMiddleware、DownloadTimeoutMiddleware、HttpAuthMiddleware、HttpCacheMiddleware、HttpCompressionMiddleware、HttpProxyMiddleware、RedirectMiddleware和其他各种 DOWNLOADER_MIDDLEWARES_BASE
里出现的下载中间件,这里不介绍了。
11.爬虫中间件
具体见middlewares.py文件。
12.添加extensions
具体见原文:
The extensions framework provides a mechanism for inserting your own custom functionality into Scrapy.
Extensions are just regular classes.
Extension settings¶
Extensions use the Scrapy settings to manage their settings, just like any other Scrapy code.
It is customary for extensions to prefix their settings with their own name, to avoid collision with existing (and future) extensions. For example, a hypothetic extension to handle Google Sitemaps would use settings like GOOGLESITEMAP_ENABLED, GOOGLESITEMAP_DEPTH, and so on.
Loading & activating extensions¶
Extensions are loaded and activated at startup by instantiating a single instance of the extension class per spider being run. All the extension initialization code must be performed in the class __init__ method.
To make an extension available, add it to the EXTENSIONS setting in your Scrapy settings. In EXTENSIONS, each extension is represented by a string: the full Python path to the extension’s class name. For example:
EXTENSIONS = {
'scrapy.extensions.corestats.CoreStats': 500,
'scrapy.extensions.telnet.TelnetConsole': 500,
}
As you can see, the EXTENSIONS setting is a dict where the keys are the extension paths, and their values are the orders, which define the extension loading order. The EXTENSIONS setting is merged with the EXTENSIONS_BASE setting defined in Scrapy (and not meant to be overridden) and then sorted by order to get the final sorted list of enabled extensions.
As extensions typically do not depend on each other, their loading order is irrelevant in most cases. This is why the EXTENSIONS_BASE setting defines all extensions with the same order (0). However, this feature can be exploited if you need to add an extension which depends on other extensions already loaded.
Available, enabled and disabled extensions¶
Not all available extensions will be enabled. Some of them usually depend on a particular setting. For example, the HTTP Cache extension is available by default but disabled unless the HTTPCACHE_ENABLED setting is set.
Disabling an extension¶
In order to disable an extension that comes enabled by default (i.e. those included in the EXTENSIONS_BASE setting) you must set its order to None. For example:
EXTENSIONS = {
'scrapy.extensions.corestats.CoreStats': None,
}
Writing your own extension¶
Each extension is a Python class. The main entry point for a Scrapy extension (this also includes middlewares and pipelines) is the from_crawler class method which receives a Crawler instance. Through the Crawler object you can access settings, signals, stats, and also control the crawling behaviour.
Typically, extensions connect to signals and perform tasks triggered by them.
Finally, if the from_crawler method raises the NotConfigured exception, the extension will be disabled. Otherwise, the extension will be enabled.
Sample extension¶
Here we will implement a simple extension to illustrate the concepts described in the previous section. This extension will log a message every time:
a spider is opened
a spider is closed
a specific number of items are scraped
The extension will be enabled through the MYEXT_ENABLED setting and the number of items will be specified through the MYEXT_ITEMCOUNT setting.
Here is the code of such extension:
import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured
logger = logging.getLogger(__name__)
class SpiderOpenCloseLogging:
def __init__(self, item_count):
self.item_count = item_count
self.items_scraped = 0
@classmethod
def from_crawler(cls, crawler):
# first check if the extension should be enabled and raise
# NotConfigured otherwise
if not crawler.settings.getbool('MYEXT_ENABLED'):
raise NotConfigured
# get the number of items from settings
item_count = crawler.settings.getint('MYEXT_ITEMCOUNT', 1000)
# instantiate the extension object
ext = cls(item_count)
# connect the extension object to signals
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)
# return the extension object
return ext
def spider_opened(self, spider):
logger.info("opened spider %s", spider.name)
def spider_closed(self, spider):
logger.info("closed spider %s", spider.name)
def item_scraped(self, item, spider):
self.items_scraped += 1
if self.items_scraped % self.item_count == 0:
logger.info("scraped %d items", self.items_scraped)
Built-in extensions reference¶
General purpose extensions¶
Log Stats extension¶
classscrapy.extensions.logstats.LogStats[source]¶
Log basic stats like crawled pages and scraped items.
Core Stats extension¶
classscrapy.extensions.corestats.CoreStats[source]¶
Enable the collection of core statistics, provided the stats collection is enabled (see Stats Collection).
Telnet console extension¶
classscrapy.extensions.telnet.TelnetConsole[source]¶
Provides a telnet console for getting into a Python interpreter inside the currently running Scrapy process, which can be very useful for debugging.
The telnet console must be enabled by the TELNETCONSOLE_ENABLED setting, and the server will listen in the port specified in TELNETCONSOLE_PORT.
Memory usage extension¶
classscrapy.extensions.memusage.MemoryUsage[source]¶
Note
This extension does not work in Windows.
Monitors the memory used by the Scrapy process that runs the spider and:
sends a notification e-mail when it exceeds a certain value
closes the spider when it exceeds a certain value
The notification e-mails can be triggered when a certain warning value is reached (MEMUSAGE_WARNING_MB) and when the maximum value is reached (MEMUSAGE_LIMIT_MB) which will also cause the spider to be closed and the Scrapy process to be terminated.
This extension is enabled by the MEMUSAGE_ENABLED setting and can be configured with the following settings:
MEMUSAGE_LIMIT_MB
MEMUSAGE_WARNING_MB
MEMUSAGE_NOTIFY_MAIL
MEMUSAGE_CHECK_INTERVAL_SECONDS
Memory debugger extension¶
classscrapy.extensions.memdebug.MemoryDebugger[source]¶
An extension for debugging memory usage. It collects information about:
objects uncollected by the Python garbage collector
objects left alive that shouldn’t. For more info, see Debugging memory leaks with trackref
To enable this extension, turn on the MEMDEBUG_ENABLED setting. The info will be stored in the stats.
Close spider extension¶
classscrapy.extensions.closespider.CloseSpider[source]¶
Closes a spider automatically when some conditions are met, using a specific closing reason for each condition.
The conditions for closing a spider can be configured through the following settings:
CLOSESPIDER_TIMEOUT
CLOSESPIDER_ITEMCOUNT
CLOSESPIDER_PAGECOUNT
CLOSESPIDER_ERRORCOUNT
Note
When a certain closing condition is met, requests which are currently in the downloader queue (up to CONCURRENT_REQUESTS requests) are still processed.
CLOSESPIDER_TIMEOUT¶
Default: 0
An integer which specifies a number of seconds. If the spider remains open for more than that number of second, it will be automatically closed with the reason closespider_timeout. If zero (or non set), spiders won’t be closed by timeout.
CLOSESPIDER_ITEMCOUNT¶
Default: 0
An integer which specifies a number of items. If the spider scrapes more than that amount and those items are passed by the item pipeline, the spider will be closed with the reason closespider_itemcount. If zero (or non set), spiders won’t be closed by number of passed items.
CLOSESPIDER_PAGECOUNT¶
Default: 0
An integer which specifies the maximum number of responses to crawl. If the spider crawls more than that, the spider will be closed with the reason closespider_pagecount. If zero (or non set), spiders won’t be closed by number of crawled responses.
CLOSESPIDER_ERRORCOUNT¶
Default: 0
An integer which specifies the maximum number of errors to receive before closing the spider. If the spider generates more than that number of errors, it will be closed with the reason closespider_errorcount. If zero (or non set), spiders won’t be closed by number of errors.
StatsMailer extension¶
classscrapy.extensions.statsmailer.StatsMailer[source]¶
This simple extension can be used to send a notification e-mail every time a domain has finished scraping, including the Scrapy stats collected. The email will be sent to all recipients specified in the STATSMAILER_RCPTS setting.
Emails can be sent using the MailSender class. To see a full list of parameters, including examples on how to instantiate MailSender and use mail settings, see Sending e-mail.
Debugging extensions¶
Stack trace dump extension¶
classscrapy.extensions.debug.StackTraceDump[source]¶
Dumps information about the running process when a SIGQUIT or SIGUSR2 signal is received. The information dumped is the following:
engine status (using scrapy.utils.engine.get_engine_status())
live references (see Debugging memory leaks with trackref)
stack trace of all threads
After the stack trace and engine status is dumped, the Scrapy process continues running normally.
This extension only works on POSIX-compliant platforms (i.e. not Windows), because the SIGQUIT and SIGUSR2 signals are not available on Windows.
There are at least two ways to send Scrapy the SIGQUIT signal:
By pressing Ctrl-while a Scrapy process is running (Linux only?)
By running this command (assuming <pid> is the process id of the Scrapy process):
kill -QUIT <pid>
Debugger extension¶
classscrapy.extensions.debug.Debugger[source]¶
Invokes a Python debugger inside a running Scrapy process when a SIGUSR2 signal is received. After the debugger is exited, the Scrapy process continues running normally.
For more info see Debugging in Python.
This extension only works on POSIX-compliant platforms (i.e. not Windows).
13.其他
包括处理动态生成的数据(官方文档讲的很浅),处理内存泄漏,分布式部署爬虫(scrapyd),对Coroutines(协程)和asyncio(异步读写)的支持,各组件的API,scheduler,Item Exporters(导出为XML、CSV、JSON及JSON LINE等格式),这里就不介绍了。