本文较长,属于系统性总结,可作为使用手册查阅。

步骤/目录:
1.scrapy介绍与安装
2.scrapy使用例一
3.scrapy命令行工具
4.爬虫类
    (1)CrawlSpider
    (2)XMLFeedSpider与CSVFeedSpider
    (3)SitemapSpider
5.选择器对象
6.Items
7.Item Loaders
8.scrapy shell
9.Item Pipeline
10.Feed exports
11.request与response
12.link extractor
13.settings
14.Exceptions
15.logging
16.Stats Collection
17.Sending e-mail
18.Telnet Console
19.FAQ

本文首发于个人博客https://lisper517.top/index.php/archives/51/,转载请注明出处。
本文的目的是详细介绍scrapy,主要翻译自 scrapy官方文档
本文写作日期为2022年9月14日。运行的平台为win10,编辑器为VS Code。

1.scrapy介绍与安装

scrapy(/ˈskreɪpaɪ/)是一个高效、通用的网页爬取框架,可用于提取页面数据、自动化测试网页、网页归档等。
scrapy完全使用python语言(但是它依赖的库有些可能用了其他语言,比如C++),所以安装、使用scrapy之前需要安装、熟悉python。scrapy要求python版本在3.6以上,依赖的库有:

lxml, an efficient XML and HTML parser
parsel, an HTML/XML data extraction library written on top of lxml,
w3lib, a multi-purpose helper for dealing with URLs and web page encodings
twisted, an asynchronous networking framework
cryptography and pyOpenSSL, to deal with various network-level security needs

其中,有些依赖有版本要求:

Twisted 14.0
lxml 3.4
pyOpenSSL 0.14

使用 pip install Scrapy 即可安装scrapy及其依赖。如果用了Anaconda等包管理器,则使用 conda install -c conda-forge scrapy 命令安装scrapy。另外,官方文档最推荐的是在python虚拟环境中安装scrapy,以避免可能的包冲突,具体见 python官方文档

2.scrapy使用例一

安装好scrapy后,开始使用scrapy爬取一个网站: https://quotes.toscrape.com ,这是一个用来练习爬虫的网站。创建一个放scrapy爬虫的目录,比如 D:\spiders\scrapy~/spider/scrapy ,然后在cmd或终端中进入该目录,使用 scrapy startproject tutorial 来创建一个scrapy项目,这里的 tutorial 是项目名,你可以自行替换。然后在 tutorial/tutorial/spiders 目录下新建一个爬虫文件,比如 quotes_spider.py ,文件内容如下:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'https://quotes.toscrape.com/page/1/',
            'https://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

在这个爬虫类中, name 指定了爬虫名,在一个项目中的爬虫名不能相同; start_requests() 方法用于爬取初始网址,这个方法重写了父类方法,它需要返回可迭代的Requests对象; parse() 方法也重写了父类方法,它是默认的解析页面的方法,用于处理Response对象,Response对象是TextResponse类实例化来的,有一些帮助解析页面的属性和方法。
parse() 方法和其他自己写的 parse_xxx() 方法,它们的主要作用是解析响应对象,提取信息(以字典形式),找到页面上其他需要爬取的url、创建新的Request对象。

接下来开启爬虫。在cmd或终端中进入项目文件夹,使用 scrapy crawl quotes 开启爬虫。注意这里跟的是爬虫名,不是爬虫文件名。之后,在cmd进入的目录下会生成两个html文件,即这次爬虫运行的结果。

另外, start_requests() 方法也可以不写,而是改成写爬虫类的 start_urls 属性(起始url字符串的列表),这个列表中的url会自动被父类中的 start_requests() 方法调用,而不用显式指定:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
        'https://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

scrapy命令行还支持对Response进行调试,方法是 scrapy shell "url" 获取响应对象,然后使用 response.css('title')response.css('title::text').re(r'Quotes.*')response.xpath('//title') 等方法来查找html标签,但是这部分功能用浏览器自带的开发者模式(按F12)也能代替、且更方便,所以这里就不介绍了。

最后,提一下Selector对象。 response.css() 方法和 response.xpath() 方法都返回选择器对象,而且在scrapy底层css选择器是转换成xpath表达式后查找标签的,而且xpath表达式比css选择器的功能更强,所以scrapy官方文档更推荐使用xpath,但是关于xpath表达式的内容这里就不展开了,有需要的可阅读笔者之前的文章 https://lisper517.top/index.php/archives/41/ 或参考 runoob教程
Selector对象有 get()getall() 方法来返回内容,不同在于前者只返回第一个、后者返回包含所有内容的列表。两个get方法是新方法,与之对应的旧方法是 extract_first()extract() (现在它们的别名就是get和getall),但是现在新方法用的更多。它们之间只有一些微小的区别,get()只会返回一个结果,getall()总会返回一个列表(如果能取到东西的话),所以新方法的行为更加predictable。

在刚才的爬虫中修改一些内容:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
        'https://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

python中的 yield 关键字功能和return类似,但用在递归程序中非常节约资源,在scrapy的爬虫中经常使用。再次用 scrapy crawl quotes 运行爬虫,会发现打印出了几个dict,这也是scrapy返回数据的格式。

光在命令行中输出字典还不够,更常用的方法是把数据持久化存储,比如存成文件。在scrapy运行时就可以指定输出文件,如 scrapy crawl quotes -O quotes.jl ,虽然这种方式比较简陋,但命令行的 -O 参数会覆盖其他所有输出设置。这里的.jl文件比.json的好处是容易在后面附加数据(append)。

以上提到的方法是在爬虫文件中就处理爬取到的数据,然而对于复杂一点的存储,更常用的方法是更改pipelines.py文件,即设置item pipeline,这部分内容会在后面讲解。

scrapy更重要的功能是全站爬取,也就是对提取出来的url再次解析。在上面的例子中,可以把爬虫文件改写为:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

response.urljoin() 是用于有些时候,从a标签的href属性提取的网址不完整,这个方法会自动检测并补全网址。更重要的是这里用yield又构造了一个request对象, parse() 方法里出现了两次yield,这是yield和return使用的区别之一。在 scrapy.Request(next_page, callback=self.parse) 中,next_page指明下一个网址(scrapy将把这个网址放入请求队列,当下载器有空时就发出请求),callback指定调用的方法就是 parse() 方法自己,也就是说 parse() 方法自己调用了自己,这称为递归;有些时候callback也可以是自己写的其他 parse_xxx() 方法,这种情况一般是因为两个网页的格式不同。

如果解析的是相近的页面,还可以使用 response.follow() 方法,这时可以省略 response.urljoin() 方法:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

response.follow() 方法会自动调用urljoin。另外, response.follow(next_page, callback=self.parse) 的第一个参数还能换成Selector对象,只要这个标签有href属性;或者用 follow_all 方法,但是传入一个Selector对象的列表甚至是css表达式,格式稍微改一下:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        yield from response.follow_all(css='ul.pager a', callback=self.parse)

正如前面提到的,callback指定的回调函数可以是其他自己写的 parse_xxx() 方法:

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['https://quotes.toscrape.com/']

    def parse(self, response):
        author_page_links = response.css('.author + a')
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css('li.next a')
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

以上所有的方法都不需要担心重复访问同一个网址,因为scrapy自带网址去重。要关掉这个功能,可以在follow系列方法中使用 dont_filter=True ,或者设置DUPEFILTER_CLASS类来指定,但不推荐这样做。

parse 系列方法中还可以传递item数据,称为请求传参,这将在后续介绍。现在稍微提一下使用场景,比如你要爬取一个博客,博客的目录页有博文的标题、上传时间,但是又要进入每篇博文爬取详细内容,这时就可以使用请求传参。

3.scrapy命令行工具

scrapy在命令行中使用时有一些注意事项。首先,在项目目录里有一个scrapy.cfg文件,它是这个项目的配置文件,优先级最高(相对于scrapy的全局配置、用户配置)。在这个文件中,可以指定多个项目共用一个settings.py文件。

在cmd输入 scrapy ,如果cmd在scrapy项目目录之内,第一行会显示当前项目内是否有活动的爬虫,其他行则是使用帮助。下面就详细讲解scrapy命令行, [] 括起来表示可选项。。
(1)查看某个命令的详细帮助: scrapy <command> -h
(2)创建项目: scrapy startproject 项目名 [项目目录名] 。创建项目后进入项目目录。
(3)创建爬虫: scrapy genspider [-t template] <爬虫名> <待爬域名>-t 指定爬虫模板,待爬域名可以随便写。
(4)查看爬虫模板: scrapy genspider -l
(5)开启爬虫: scrapy crawl <spider>
(6)检测爬虫: scrapy check [-l] <spider>
(7)输出项目中所有爬虫: scrapy list
(8)编辑爬虫: scrapy edit <spider> ,将用默认的编辑器打开并编辑爬虫,这个编辑器可用 EDITOR 环境变量指定。
(9)爬取一个网页: scrapy fetch <url> ,爬取网页并输出其内容到standard output。在项目中使用时,它是根据settings.py或爬虫的设置取网页的,比如有时你在settings.py里设置了UA,或者对爬虫有一些个性化设置。可选项有 --spider=SPIDER (使用某个爬虫的设置)、 --headers (打印请求头)、 --no-redirect (不重定向)。
(10)查看网页: scrapy view <url> ,可选项有 --spider=SPIDER--no-redirect 。和 fetch 一起,这两个命令可用于检查爬虫和自己看到的网页有无不同。
(11)查看设置: scrapy settings [options] ,比如 scrapy settings --get BOT_NAME 。在项目内时这个命令显示项目设置,否则显示默认设置。
(12)运行单独爬虫文件: scrapy runspider <spider_file.py> ,运行一个能独立运行的爬虫文件,很少用。
(13)显示scrapy版本: scrapy version [-v]
其他命令用的不多,这里就不介绍了。另外,可以自己定制scrapy命令,需要用到 COMMANDS_MODULE 这项设置,具体参考原文。

4.爬虫类

在 scrapy使用例一中 介绍最多的就是爬虫文件中的爬虫类。所有scrapy爬虫类都需要继承自scrapy.Spider父类,这个父类的主要功能有 start_requests()start_urlsparse() ,这些已经在前文介绍过了。其他还有一些属性及方法:
(1)name,爬虫名字,用于区别项目内的各个爬虫。一般来说,一个爬虫类只会实例化一个对象,但也能实例化多个。常用的name编写方法是根据爬取的域名来截取。
(2)allowed_domains,允许爬取的域名(包括子域名),这个一般可以注释掉。
(3)start_urls,前文提过。
(4)custom_settings,用于指定爬虫的个性化设置,会覆盖掉settings.py里的设置。注意它必须写在类属性而不能在__init__里指定,因为实例化之前就会上传设置。
(5)logger,用爬虫名创建的logger对象,可以自己写log信息。或者也可用 log(message[, level, component]) 方法写log。
(6)closed(reason),当爬虫结束时调用一次这个方法。
(7)start_requests() 和 parse(response) ,前文已介绍。

另外,scrapy提供了一些衍生的爬虫类,方便爬取全站数据,或者根据sitemap爬取,或者解析XML feed。下面的例子中,假设 items.py 文件如下:

import scrapy

class TestItem(scrapy.Item):
    id = scrapy.Field()
    name = scrapy.Field()
    description = scrapy.Field()

(1)CrawlSpider

即scrapy.spiders.CrawlSpider,常用于爬取全站数据,它的特点是提取规则(rules)。一个CrawlSpider例子如下:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class CrawlspiSpider(CrawlSpider):
    name = 'crawlspi'
    #allowed_domains = ['xxx.com']
    start_urls = ['http://www.xxx.com/']

    rules = (
        Rule(LinkExtractor(allow=r'page=\d+'),
             callback='parse_item',
             follow=True),
        Rule(LinkExtractor(allow=r'xxxx'),
             callback='parse_detail',
             follow=True),
        Rule(LinkExtractor(restrict_xpaths='//a', allow=r'/wzshow_\d+\.html'),
             callback='parse_item'),  # 还可以通过xpath解析网页
    )

    # LinkExtractor(allow=r'Items/')根据规则提取链接(正则)
    # Rule对象会对提取到的网页自动发送请求并通过callback解析(无法使用请求传参)
    # follow=True即对提取到的链接页面再应用规则提取链接(全站爬取)
    # 一般第一个Rule提取其他页码链接,第二个Rule提取当前页面详情url
    # 对请求传参,在items中再写一个DetailItem类,并且有一个共同的主键PRIMARY KEY用于识别
    # 在Pipeline中对item.__class__.__name__判断,对不同的item类处理
    # 插入Mysql中时通过主键处理

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        return item

    def parse_detail(self, response):
        item = {}
        return item

如果创建了多个Rule对象,而一个链接符合多个规则,则会应用第一个匹配的规则。提取规则对象的完整实例化格式如下:

scrapy.spiders.Rule(link_extractor=None, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None, errback=None)

link_extractor指定Link Extractor对象,如果省略则匹配所有提取的链接。它是用来生成request对象的,这里的request对象的 meta['link_text'] 可以给出请求的url;callback是一个callable方法或字符串,用于调用指定的方法,一般是parse系列方法;cb_kwargs是一个参数字典,将传递给回调函数(所以也可以用请求传参,上面的注释适用于1.7版本之前的scrapy);follow则指定,对于使用规则提取到的网页,是否继续应用规则提取链接;process_links,也是一个callable方法或字符串,用于调用指定的方法,但这个方法主要用来对提取到的链接进行筛选;process_request,也是一个callable方法或字符串,用于调用指定的方法(该方法的第一个参数为request对象,第二个参数为该request对象生成的response对象,返回一个request对象或者None),它主要对请求进行筛选(返回None时就是筛选掉了请求对象);errback是发生exception时的回调函数。

官方文档给出的例子是:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').get()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').get()
        item['link_text'] = response.meta['link_text']
        url = response.xpath('//td[@id="additional_data"]/@href').get()
        return response.follow(url, self.parse_additional_page, cb_kwargs=dict(item=item))

    def parse_additional_page(self, response, item):
        item['additional_data'] = response.xpath('//p[@id="additional_data"]/text()').get()
        return item

注意这里用的都是return而非yield。

(2)XMLFeedSpider与CSVFeedSpider

分别用于解析xml的feed网页和csv,这里就不详细介绍了。它们也使用return而非yield。

(3)SitemapSpider

根据sitemap爬网页,这个其实用的也很少。它能手动指定sitemap网页,或者根据robots.txt找到sitemap网页。

5.选择器对象

在爬虫界,常用的解析网页html或xml的方式有beautifulsoup(缺点是慢)、正则表达式(麻烦,容易解析错误)、lxml;lxml又分为css、xpath等。scrapy的Selector对象提供了一些方法,也能解析html,它在底层使用的是lxml的xpath表达式,解析速度也差不多。正如前文提到的, response.css()response.xpath() 方法返回选择器对象列表,对选择器对象列表可使用 get()getall() 方法,这两个方法如果取不到对象就会返回None,但是如果用 get(default="some-string") 也可以指定取不到时返回什么。除了这两个方法,scrapy还提供了选择器的其他方法来取出属性等,比如attrib、re方法,这里就不介绍了。
另外,在xpath函数里使用 .//text() 时最好用 . 代替;xpath还可以传递参数,比如 response.xpath('//div[count(a)=$cnt]/@id', cnt=5)
scrapy还为xpath提供了自己的has-class方法: response.xpath('//p[has-class("foo", "bar-baz")]') ,这样就能选中像 <p class="foo bar-baz">First</p><p class="foo">Second</p> 这样的标签(属性值变化),但它比css方法稍慢。
parsel.xpathfuncs.set_xpathfunc(fname, func) 可以自己注册xpath函数。

选择器对象的构造函数为 scrapy.selector.Selector(*args, **kwargs) ,参数有: response ,指定HtmlResponse或XmlResponse对象; text ,指定响应的正文,不能和 response 一起用; type ,指定格式,可选择html、xml或None,为None时自动选择格式;

另外,为了方便,scrapy还自带了选择器列表的类,它衍生于基础的list,这样对于元素为选择器的列表也能使用以上方法。

6.Items

items系列的类就是字典或者可以当成字典使用,用于爬虫处理完信息后的传递。items类在items.py中进行定义,常见的有:
(1)字典,就是python字典。
(2)item对象,它提供了类似字典的API和一些其他功能,构造函数为 scrapy.item.Item([arg])scrapy.Item([arg]) ,简单来说你需要在items.py里规定一个field,然后才能在爬虫中使用这个field,否则会产生exception。另外,还有 Item.copy()Item.deepcopy() 方法用于浅拷贝和深拷贝item对象。一个使用item对象的items.py文件如下:

import scrapy


class IpoolItem(scrapy.Item):
    one_field  = scrapy.Field()
    another_field = scrapy.Field()

(3)dataclass对象,它支持对每个field设置默认值,使用例为:

from dataclasses import dataclass

@dataclass
class CustomItem:
    one_field: str
    another_field: int

虽然规定了field的类型,但在爬虫中为field赋值时不会检查类型。
(4)attr.s对象,使用例为:

import attr

@attr.s
class CustomItem:
    one_field = attr.ib()
    another_field = attr.ib()

同样支持默认值。

items系列的对象可以使用字典的 keys()items() 方法。另外,由于python的机制,在items中存放列表或字典等数据时,如果要拷贝则应当使用深拷贝。

最后,items对象还支持一些metadata(超参数);你可以从自己定义的item类衍生子类,来增加field或修改field的超参数:

class IpoolItem(scrapy.Item):
    one_field  = scrapy.Field()
    another_field = scrapy.Field()

class IpoolItem(IpoolItem):
    third_field = scrapy.Field()

7.Item Loaders

item提供了存放数据的容器,而item loader可以帮助你更便捷地把数据放进容器。比方说,你可以使用item loader多次对一个field赋值,item loader会把你赋的值组合起来(字符串会append)。item loader是在爬虫文件中使用的,使用例:

from scrapy.loader import ItemLoader
from myproject.items import Product

def parse(self, response):
    l = ItemLoader(item=ProductItem(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()

这里使用了 add_xpath()add_css()add_value()load_item() 方法,注意它们的行为。在使用 load_item() 之前,所有的数据都暂存在item loader里,使用之后才会赋值到item的field中。实际上,对于每个field,每次在item loader中赋值时都是从一个input processor进入,使用 load_item() 时从一个output processor出来,这涉及到一些其他用法。

Itemloader的构造函数为 scrapy.loader.ItemLoader(item=None, selector=None, response=None, parent=None, **context) ,其中 selectorresponse 是互斥的,前者优先。另外,如果要更改值而不是append,ItemLoader类提供了replace系列方法,就是把add系列中的add换成replace。

最后,Itemloader可以停留在某个节点位置:

loader = ItemLoader(item=Item())
# load stuff not in the footer
footer_loader = loader.nested_xpath('//footer')
footer_loader.add_xpath('social', 'a[@class = "social"]/@href')
footer_loader.add_xpath('email', 'a[@class = "email"]/@href')
# no need to call footer_loader.load_item()
loader.load_item()

这里用的 add_xpath 中的xpath表达式都是相对于footer标签的位置,也就是说完整的xpath表达式为 //footer/a[@class = "social"]/@href

总的来说,itemloader是为了简化item生成而存在的,在某些情况下有用,但一般不会用到。

8.scrapy shell

它的主要作用是检测xpath、css表达式。由于浏览器的开发者模式能更便捷地做到相同的事情,这里就不介绍了。

9.Item Pipeline

item管道,在爬虫生成item类后会把这些item数据传递到管道中进行处理,即scrapy项目中的pipelines.py文件。那你可能就要问为什么我不在爬虫中,刚爬到数据就把数据处理了呢?这一方面是为了代码的简洁性,另一方面管道有一些独到的优势。一般来说,管道可以用于:理清html数据,验证数据,数据去重,数据存入数据库。

在pipelines.py中可以写你自己的管道,必须提供的方法只有 process_item(self, item, spider) ,它用于处理item,它必须返回一个item对象或Twisted的Deferred,或者raise一个Dropitem的exception(此时这个item就不会向下面的管道继续传递)。
另外,你还可以选择写以下方法:
open_spider(self, spider) ,只在爬虫开启时调用一次,你可以在这里使用pymysql创建connection、cursor,或者创建一个文件读写对象;相对的是 close_spider(self, spider) ,只在爬虫关闭时调用一次,可以关闭conn、cur、文件读写对象;另外还有 from_crawler(cls, crawler) 方法,当它存在时,会根据crawler对象实例化一个管道,所以它也必须返回一个管道对象,这个方法的作用是为管道提供访问settings等核心的途径,以及提供函数钩子(hook)。

管道文件的范例:

from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
class PricePipeline:

    vat_factor = 1.15

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter.get('price'):
            if adapter.get('price_excludes_vat'):
                adapter['price'] = adapter['price'] * self.vat_factor
            return item
        else:
            raise DropItem(f"Missing price in {item}")

再来一个(存入mongodb):

import pymongo
from itemadapter import ItemAdapter

class MongoPipeline:

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(ItemAdapter(item).asdict())
        return item

最后,如果自己在pipelines.py里写了其他管道,或者修改了默认管道名,需要在settings.py里打开管道:

ITEM_PIPELINES = {
    'myproject.pipelines.管道名1': 300,
    'myproject.pipelines.管道名2': 800,
}

后面的300、800,它们的取值范围为0-1000,item将从低值的管道流经高值的管道。

10.Feed exports

你可以用Feed exports来把数据导出成json、json lines(jl)、csv或xml等格式,或者python的pickle,这样数据就能用于不同的系统了。这个功能主要需要在settings.py中开启,相关的设置有:

FEEDS
FEED_EXPORT_ENCODING
FEED_STORE_EMPTY
FEED_EXPORT_FIELDS
FEED_EXPORT_INDENT
FEED_STORAGES
FEED_STORAGE_FTP_ACTIVE
FEED_STORAGE_S3_ACL
FEED_EXPORTERS
FEED_EXPORT_BATCH_ITEM_COUNT

第一项是必须的。范例:

FEEDS={
    'items.json': {
        'format': 'json',
        'encoding': 'utf8',
        'store_empty': False,
        'item_classes': [MyItemClass1, 'myproject.items.MyItemClass2'],
        'fields': None,
        'indent': 4,
        'item_export_kwargs': {
           'export_empty_fields': True,
        },
    },
    '/home/user/documents/items.xml': {
        'format': 'xml',
        'fields': ['name', 'price'],
        'item_filter': MyCustomFilter1,
        'encoding': 'latin1',
        'indent': 8,
    },
    pathlib.Path('items.csv.gz'): {
        'format': 'csv',
        'fields': ['price', 'name'],
        'item_filter': 'myproject.filters.MyCustomFilter2',
        'postprocessing': [MyPlugin1, 'scrapy.extensions.postprocessing.GzipPlugin'],
        'gzip_compresslevel': 5,
    },
}

各个参数的解释:

The following is a list of the accepted keys and the setting that is used as a fallback value if that key is not provided for a specific feed definition:

format: the serialization format.

This setting is mandatory, there is no fallback value.

batch_item_count: falls back to FEED_EXPORT_BATCH_ITEM_COUNT.

New in version 2.3.0.

encoding: falls back to FEED_EXPORT_ENCODING.

fields: falls back to FEED_EXPORT_FIELDS.

item_classes: list of item classes to export.

If undefined or empty, all items are exported.

New in version 2.6.0.

item_filter: a filter class to filter items to export.

ItemFilter is used be default.

New in version 2.6.0.

indent: falls back to FEED_EXPORT_INDENT.

item_export_kwargs: dict with keyword arguments for the corresponding item exporter class.

New in version 2.4.0.

overwrite: whether to overwrite the file if it already exists (True) or append to its content (False).

The default value depends on the storage backend:

Local filesystem: False

FTP: True

Note

Some FTP servers may not support appending to files (the APPE FTP command).

S3: True (appending is not supported)

Standard output: False (overwriting is not supported)

New in version 2.4.0.

store_empty: falls back to FEED_STORE_EMPTY.

uri_params: falls back to FEED_URI_PARAMS.

postprocessing: list of plugins to use for post-processing.

The plugins will be used in the order of the list passed.

New in version 2.6.0.

其他设置的解析:

FEED_EXPORT_ENCODING¶
Default: None

The encoding to be used for the feed.

If unset or set to None (default) it uses UTF-8 for everything except JSON output, which uses safe numeric encoding (\uXXXX sequences) for historic reasons.

Use utf-8 if you want UTF-8 for JSON too.

FEED_EXPORT_FIELDS¶
Default: None

A list of fields to export, optional. Example: FEED_EXPORT_FIELDS = ["foo", "bar", "baz"].

Use FEED_EXPORT_FIELDS option to define fields to export and their order.

When FEED_EXPORT_FIELDS is empty or None (default), Scrapy uses the fields defined in item objects yielded by your spider.

If an exporter requires a fixed set of fields (this is the case for CSV export format) and FEED_EXPORT_FIELDS is empty or None, then Scrapy tries to infer field names from the exported data - currently it uses field names from the first item.

FEED_EXPORT_INDENT¶
Default: 0

Amount of spaces used to indent the output on each level. If FEED_EXPORT_INDENT is a non-negative integer, then array elements and object members will be pretty-printed with that indent level. An indent level of 0 (the default), or negative, will put each item on a new line. None selects the most compact representation.

Currently implemented only by JsonItemExporter and XmlItemExporter, i.e. when you are exporting to .json or .xml.

FEED_STORE_EMPTY¶
Default: False

Whether to export empty feeds (i.e. feeds with no items).

FEED_STORAGES¶
Default: {}

A dict containing additional feed storage backends supported by your project. The keys are URI schemes and the values are paths to storage classes.

FEED_STORAGE_FTP_ACTIVE¶
Default: False

Whether to use the active connection mode when exporting feeds to an FTP server (True) or use the passive connection mode instead (False, default).

For information about FTP connection modes, see What is the difference between active and passive FTP?.

FEED_STORAGE_S3_ACL¶
Default: '' (empty string)

A string containing a custom ACL for feeds exported to Amazon S3 by your project.

For a complete list of available values, access the Canned ACL section on Amazon S3 docs.

FEED_STORAGES_BASE¶
Default:

{
    '': 'scrapy.extensions.feedexport.FileFeedStorage',
    'file': 'scrapy.extensions.feedexport.FileFeedStorage',
    'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
    's3': 'scrapy.extensions.feedexport.S3FeedStorage',
    'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}
A dict containing the built-in feed storage backends supported by Scrapy. You can disable any of these backends by assigning None to their URI scheme in FEED_STORAGES. E.g., to disable the built-in FTP storage backend (without replacement), place this in your settings.py:

FEED_STORAGES = {
    'ftp': None,
}
FEED_EXPORTERS¶
Default: {}

A dict containing additional exporters supported by your project. The keys are serialization formats and the values are paths to Item exporter classes.

FEED_EXPORTERS_BASE¶
Default:

{
    'json': 'scrapy.exporters.JsonItemExporter',
    'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
    'jl': 'scrapy.exporters.JsonLinesItemExporter',
    'csv': 'scrapy.exporters.CsvItemExporter',
    'xml': 'scrapy.exporters.XmlItemExporter',
    'marshal': 'scrapy.exporters.MarshalItemExporter',
    'pickle': 'scrapy.exporters.PickleItemExporter',
}
A dict containing the built-in feed exporters supported by Scrapy. You can disable any of these exporters by assigning None to their serialization format in FEED_EXPORTERS. E.g., to disable the built-in CSV exporter (without replacement), place this in your settings.py:

FEED_EXPORTERS = {
    'csv': None,
}
FEED_EXPORT_BATCH_ITEM_COUNT¶
New in version 2.3.0.

Default: 0

If assigned an integer number higher than 0, Scrapy generates multiple output files storing up to the specified number of items in each output file.

When generating multiple output files, you must use at least one of the following placeholders in the feed URI to indicate how the different output file names are generated:

%(batch_time)s - gets replaced by a timestamp when the feed is being created (e.g. 2020-03-28T14-45-08.237134)

%(batch_id)d - gets replaced by the 1-based sequence number of the batch.

Use printf-style string formatting to alter the number format. For example, to make the batch ID a 5-digit number by introducing leading zeroes as needed, use %(batch_id)05d (e.g. 3 becomes 00003, 123 becomes 00123).

For instance, if your settings include:

FEED_EXPORT_BATCH_ITEM_COUNT = 100
And your crawl command line is:

scrapy crawl spidername -o "dirname/%(batch_id)d-filename%(batch_time)s.json"
The command line above can generate a directory tree like:

->projectname
-->dirname
--->1-filename2020-03-28T14-45-08.237134.json
--->2-filename2020-03-28T14-45-09.148903.json
--->3-filename2020-03-28T14-45-10.046092.json
Where the first and second files contain exactly 100 items. The last one contains 100 items or fewer.

FEED_URI_PARAMS¶
Default: None

A string with the import path of a function to set the parameters to apply with printf-style string formatting to the feed URI.

The function signature should be as follows:

scrapy.extensions.feedexport.uri_params(params, spider)¶
Return a dict of key-value pairs to apply to the feed URI using printf-style string formatting.

Parameters
params (dict) –

default key-value pairs

Specifically:

batch_id: ID of the file batch. See FEED_EXPORT_BATCH_ITEM_COUNT.

If FEED_EXPORT_BATCH_ITEM_COUNT is 0, batch_id is always 1.

New in version 2.3.0.

batch_time: UTC date and time, in ISO format with : replaced with -.

See FEED_EXPORT_BATCH_ITEM_COUNT.

New in version 2.3.0.

time: batch_time, with microseconds set to 0.

spider (scrapy.Spider) – source spider of the feed items

Caution

The function should return a new dictionary, modifying the received params in-place is deprecated.

For example, to include the name of the source spider in the feed URI:

Define the following function somewhere in your project:

# myproject/utils.py
def uri_params(params, spider):
    return {**params, 'spider_name': spider.name}
Point FEED_URI_PARAMS to that function in your settings:

# myproject/settings.py
FEED_URI_PARAMS = 'myproject.utils.uri_params'
Use %(spider_name)s in your feed URI:

scrapy crawl <spider_name> -o "%(spider_name)s.jl"

因为笔者用这个功能不多,这里就不翻译了。

11.request与response

request对象由爬虫生成,最后传递到下载器;response则是由下载器根据request生成,返回给发出request的爬虫。二者都有一些衍生类,下面先介绍一下基类。

classscrapy.http.Request(*args, **kwargs) ,它的参数有:
(1)url ,一个网址字符串;.
(2)callback ,当request返回response后会用这个参数指定的回调函数接受response对象,所以这个回调函数一般是爬虫中 parse 系列的方法,默认是用 parse() ,而解析出错时会调用 errback 指定的回调函数。
(3) method, 字符串,发起请求的方法,默认是 GET .
(4) meta ,字典,它会赋给request.meta数据成员(浅拷贝),就是之前提到的请求传参(但是现在更推荐使用 cb_kwargs 来请求传参)。它还会传递给对应的response对象。
(5) body ,可能是bytes或str,表示请求体,实际的请求体会以二进制形式存储。
(6) headers ,字典,就是请求头。
(7) cookies ,字典,用于设置请求头里的cookie,但目前版本的scrapy(2.6)这个参数还不能使用,需要设置cookie的话用 request.cookies 在下载中间件或CookiesMiddlewares里设置,另外需要注意的是cookies会根据网站返回的cookies更改,这点更像真的浏览器,要关闭这个功能可以在 request.meta 里把 dont_merge_cookies 赋值为True。
(8) encoding ,字符串,指定request的字符集,默认为 uft-8 。这个编码格式会用来对url进行百分号编码(如果平时观察过url,会发现空格转换成了%20,汉字也用%转码),并且将请求体转换成二进制。
(9) priority ,这个请求的优先级,默认为0。scheduler将用优先级来决定请求发送的顺序,越大的优先级越高。
(10) dont_filter ,scheduler会对网址进行去重过滤,这个参数为True时就不会去重过滤。比如有些网址每次访问的结果都不同,就可以把去重关闭,但要小心使用、避免loop。
(11) errback ,处理request出现exception时会调用的函数,包括页面的404错误。
(12) flags ,list,用来logging。
(13) cb_kwargs ,字典,就是callback指定的回调函数的参数字典。

下面介绍request对象的成员:
(1) url ,str,请求网址。注意这个网址是被百分号编码过的。它是read-only的,要替换它请使用request对象的 replace() 方法,见后。
(2) method ,str,表示请求方法,比如 "GET""POST""PUT"
(3) headers ,类似字典的对象,表示请求头。
(4) body ,二进制的请求体。同样是只读的,可以用 replace() 方法更改。
(5) meta ,字典,用来存储一些metadata,可以被其他extensions、中间件访问。另外,request的meta字典会传递给对应的response。
(6) cb_kwargs ,字典,传递给回调函数的参数字典。
(7) attributes: Tuple[str, ...]= ('url', 'callback', 'method', 'headers', 'body', 'cookies', 'meta', 'encoding', 'priority', 'dont_filter', 'errback', 'flags', 'cb_kwargs') ,一个包含请求的基本信息的字符串元组。被 Request.replace()Request.to_dict()request_from_dict() 调用。
(8) copy(),返回一个请求对象的浅拷贝。
(9) replace([url, method, headers, body, cookies, meta, flags, encoding, priority, dont_filter, callback, errback, cb_kwargs]) ,它会返回一个请求对象,所有提到的请求对象成员会被替换为新值、未提到的保留原值,用于更改请求对象的一些成员。这里也是浅拷贝。
(10) to_dict(*, spider: Optional[Spider] = None)→ dict ,和 attributes 方法功能类似,但是返回一个字典。使用这个字典,搭配 scrapy.utils.request.request_from_dict(d: dict, *, spider: Optional[Spider] = None)→ Request 方法,可以根据字典生成request对象。 spider: Optional[Spider] = None 可选项用于根据爬虫名找到对应的爬虫,也就是说这个request生成的response对象必须用该爬虫的 parse 方法。

下面介绍一下errback回调函数的书写格式。它接受的第一个参数必须是failure,使用例:

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "https://example.invalid/",             # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

以及用failure.request.cb_kwargs的例子:

def parse(self, response):
    request = scrapy.Request('http://www.example.com/index.html',
                             callback=self.parse_page2,
                             errback=self.errback_page2,
                             cb_kwargs=dict(main_url=response.url))
    yield request

def parse_page2(self, response, main_url):
    pass

def errback_page2(self, failure):
    yield dict(
        main_url=failure.request.cb_kwargs['main_url'],
    )

下面介绍一下meta这个超数据。2.6版本的scrapy,一般不用它来请求传参,而是用于在各组件之间传递信息。它自带的键有:

bindaddress
cookiejar
dont_cache
dont_merge_cookies
dont_obey_robotstxt
dont_redirect
dont_retry
download_fail_on_dataloss
download_latency
download_maxsize
download_timeout
ftp_password (See FTP_PASSWORD for more info)
ftp_user (See FTP_USER for more info)
handle_httpstatus_all
handle_httpstatus_list
max_retry_times
proxy
redirect_reasons
redirect_urls
referrer_policy

比较值得注意的可能是proxy,可以在下载中间件中为请求对象设置代理。

另外scrapy还提供了一些request对象的子类,包括:
(1)FormRequest,主要用于处理html的表单,比如用于模拟登录(简单的输入用户名密码),但是这个功能使用request对象也不复杂。
(2)JsonRequest,用于处理会返回json数据的请求,也没什么用。

最后介绍一下response对象。构造函数为 classscrapy.http.Response(*args, **kwargs) ,参数有:
(1) url ,字符串,响应来源网址(原文未提到是否被百分号转义过,似乎没有;原文提到Response.request.url doesn’t always equal Response.url)。
(2) status ,int,响应的状态码。这和http协议有关,正常情况下一般是200,如果是404等就表示错误。
(3) headers ,字典,响应头。
(4) body ,bytes,二进制的响应体。如果希望拿到响应源码,使用 response.text (但是注意只有TextResponse及其子类才有text数据成员)。
(5) flags ,list,一个包含 Response.flags 属性的列表。传递给response对象时使用的是浅拷贝。
(6) request ,scrapy.Request类或其子类的实例,就是用于生成response的request对象。
(7) certificate ,twisted.internet.ssl.Certificate,SSL证书。
(8) ip_address ,ipaddress.IPv4Address或者ipaddress.IPv6Address,表示响应来源网址的ip地址。
(9) protocol ,str,下载响应的协议,比如 "HTTP/1.0""HTTP/1.1" 或者 "h2"

response对象有以下成员:
(1) url ,只读,可用 replace() 方法更改。
(2) status
(3) headers ,可用 response.headers.getlist('键名')get() 方法来取出响应头中的某项值。
(4) body ,只读,可用 replace() 方法更改。
(5) request ,发生http重定向时,原来的request会和重定向后的response关联;这个属性只应该在爬虫文件、爬虫中间件中使用,不要在下载中间件里调用。
(6) meta ,从request对象传递过来的超数据,即使发生重定向或网页retry也不会改变关联。
(7) cb_kwargs ,scrapy 2.0版本后新增,从request对象那里拿到的字典参数表,会传递给回调函数。它也不受重定向和retry影响。
(8) flags ,表示response状态,比如 "cached""redirected" ,会用于logging。
(9) certificate
(10) ip_address
(11) protocol
(12) attributes: Tuple[str, ...]= ('url', 'status', 'headers', 'body', 'flags', 'request', 'certificate', 'ip_address', 'protocol') ,一个字符串元组,包含response对象的信息,被 replace() 方法调用。
(13) copy() ,返回当前response对象的浅拷贝。
(14) replace([url, status, headers, body, request, flags, cls]) ,把提到的值更改,其余值不变,返回新的response对象,同样是浅拷贝。
(15) urljoin(url) ,对于互相关联的url,使用这个方法会把相对url改写成绝对url。它实际上是使用了 urllib.parse.urljoin(response.url, url) 这个方法。
(16) follow(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None)→ Request ,返回一个request对象,参数和request对象的构造方法一样,不同在于url参数可以使用相对url,或者scrapy.link.Link对象。
(17) follow_all(urls, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None)→ Generator[Request, None, None] ,类似 follow() 方法。

下面介绍一些response的子类:
(1)scrapy.http.TextResponse,添加了编码方式,可用于图像、声音等媒体文件爬取。最主要的区别是在构造函数中有一个参数 encoding ,str,指定响应的解析格式,为None时会自动匹配。另外它还多出了一些属性、方法,包括 textencodingselectorxpath()css()json() (将json文件转换为python对象)。
(2)scrapy.http.HtmlResponse和scrapy.http.XmlResponse,它们都是TextResponse的子类。

12.link extractor

link extractor对象可以从response对象中提取链接,可用于全站爬取。这在前面介绍CrawlSpider类时提到过。
除了用于CrawlSpider爬虫类,也可以用于其他爬虫,比如用 LxmlLinkExtractor.extract_links 方法得到响应对象里所有符合条件的link对象列表。

它的构造函数为 scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href',), canonicalize=False, unique=True, process_value=None, strip=True) ,参数有:
(1) allow ,str或list,一个re字符串或re字符串列表,提取的链接必须满足这些re规则。不给出或为空字符串时,表示对链接不做re限制。
(2) deny ,str或list,用于给出排除链接的re规则,它的优先级高于 allow (先排除后提取)。也可以不给出或为空str。
(3) allow_domains ,str或list,str或str列表,只有response.url在这里面时才会用于提取链接。
(4) deny_domains ,和 allow_domains 相反。
(5) deny_extensions ,str列表,用于给出提取链接时忽略的extension,默认由 scrapy.linkextractors.IGNORED_EXTENSIONS 指定。在scrapy 2.0版本后, IGNORED_EXTENSIONS 包括 7z, 7zip, apk, bz2, cdr, dmg, ico, iso, tar, tar.gz, webm, 和 xz。
(6) restrict_xpaths ,str或list,只在这些xpath表达式选中的节点里提取链接。
(7) restrict_css ,str或list,功能同上。
(8) restrict_text ,str或list,和 allow 类似,不同之处在于各个re规则之间是 或 的关系,也就是链接只要满足其中一条re规则就可以被提取。
(9) tags ,str或list,用于提取链接的标签名必须在这里给出的标签名中。常用的标签名有 a , area 。
(10) attrs ,list,用于提取链接的属性,默认是 'href'
(11) canonicalize ,bool,是否对被提取的链接使用 w3lib.url.canonicalize_url 来规范化,默认为False,不推荐改动这个值。
(12) unique ,bool,是否在提取链接这个环节就去重。
(13) process_value ,collections.abc.Callable,它是一个函数,用于对提取的链接进行处理,默认是 lambda x: x 。使用例:要提取链接 <a href="javascript:goToPage('../other/page.html'); return false">Link text</a> ,使用:

def process_value(value):
    m = re.search("javascript:goToPage\('(.*?)'", value)
    if m:
        return m.group(1)

(14) strip ,bool,是否对提取到的链接str运行 strip() 方法,这个不需要改动,因为有些标准要求url前后不得有空白字符。

LxmlLinkExtractor还有 extract_links(response) 方法,用于将提取规则应用于response对象、返回符合规则的link对象列表。

link对象,构造函数为 scrapy.link.Link(url, text='', fragment='', nofollow=False) ,表示用linkextractor提取出的链接对象。原文给出的参数例子:

<a href="https://example.com/nofollow.html#foo" rel="nofollow">Dont follow this one</a>
Parameters
url – the absolute url being linked to in the anchor tag. From the sample, this is https://example.com/nofollow.html.

text – the text in the anchor tag. From the sample, this is Dont follow this one.

fragment – the part of the url after the hash symbol. From the sample, this is foo.

nofollow – an indication of the presence or absence of a nofollow value in the rel attribute of the anchor tag.

13.settings

这部分内容参考 Python,爬虫与深度学习(11)——scrapy中settings的可选项 或原文。
下面是一些没有提到的可选项:

AJAXCRAWL_ENABLED

AUTOTHROTTLE_DEBUG

AUTOTHROTTLE_ENABLED

AUTOTHROTTLE_MAX_DELAY

AUTOTHROTTLE_START_DELAY

AUTOTHROTTLE_TARGET_CONCURRENCY

CLOSESPIDER_ERRORCOUNT

CLOSESPIDER_ITEMCOUNT

CLOSESPIDER_PAGECOUNT

CLOSESPIDER_TIMEOUT

COMMANDS_MODULE

COMPRESSION_ENABLED

COOKIES_DEBUG

COOKIES_ENABLED

FEEDS

FEED_EXPORTERS

FEED_EXPORTERS_BASE

FEED_EXPORT_BATCH_ITEM_COUNT

FEED_EXPORT_ENCODING

FEED_EXPORT_FIELDS

FEED_EXPORT_INDENT

FEED_STORAGES

FEED_STORAGES_BASE

FEED_STORAGE_FTP_ACTIVE

FEED_STORAGE_S3_ACL

FEED_STORE_EMPTY

FEED_URI_PARAMS

FILES_EXPIRES

FILES_RESULT_FIELD

FILES_STORE

FILES_STORE_GCS_ACL

FILES_STORE_S3_ACL

FILES_URLS_FIELD

HTTPCACHE_ALWAYS_STORE

HTTPCACHE_DBM_MODULE

HTTPCACHE_DIR

HTTPCACHE_ENABLED

HTTPCACHE_EXPIRATION_SECS

HTTPCACHE_GZIP

HTTPCACHE_IGNORE_HTTP_CODES

HTTPCACHE_IGNORE_MISSING

HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS

HTTPCACHE_IGNORE_SCHEMES

HTTPCACHE_POLICY

HTTPCACHE_STORAGE

HTTPERROR_ALLOWED_CODES

HTTPERROR_ALLOW_ALL

HTTPPROXY_AUTH_ENCODING

HTTPPROXY_ENABLED

IMAGES_EXPIRES

IMAGES_MIN_HEIGHT

IMAGES_MIN_WIDTH

IMAGES_RESULT_FIELD

IMAGES_STORE

IMAGES_STORE_GCS_ACL

IMAGES_STORE_S3_ACL

IMAGES_THUMBS

IMAGES_URLS_FIELD

MAIL_FROM

MAIL_HOST

MAIL_PASS

MAIL_PORT

MAIL_SSL

MAIL_TLS

MAIL_USER

MEDIA_ALLOW_REDIRECTS

METAREFRESH_ENABLED

METAREFRESH_IGNORE_TAGS

METAREFRESH_MAXDELAY

REDIRECT_ENABLED

REDIRECT_MAX_TIMES

REFERER_ENABLED

REFERRER_POLICY

RETRY_ENABLED

RETRY_HTTP_CODES

RETRY_PRIORITY_ADJUST

RETRY_TIMES

TELNETCONSOLE_HOST

TELNETCONSOLE_PASSWORD

TELNETCONSOLE_PORT

TELNETCONSOLE_USERNAME

14.Exceptions

scrapy提供了一些exception类。有些不太能翻译,就照搬原文了。
(1) scrapy.exceptions.CloseSpider(reason='cancelled') ,爬虫的回调函数要求爬虫关闭、停止时会引发这个错误。 reason 参数用于描述关闭原因。
使用例:

def parse_page(self, response):
    if 'Bandwidth exceeded' in response.body:
        raise CloseSpider('bandwidth_exceeded')

(2) scrapy.exceptions.DontCloseSpider,This exception can be raised in a spider_idle signal handler to prevent the spider from being closed.
(3) scrapy.exceptions.DropItem ,这个错误是在管道处理数据时,如果item不符合要求,可以手动raise这个错误来放弃item。
(4) scrapy.exceptions.IgnoreRequest ,This exception can be raised by the Scheduler or any downloader middleware to indicate that the request should be ignored.
(5) scrapy.exceptions.NotConfigured ,This exception can be raised by some components to indicate that they will remain disabled. Those components include:

Extensions
Item pipelines
Downloader middlewares
Spider middlewares

The exception must be raised in the component’s __init__ method.
(6) scrapy.exceptions.NotSupported ,This exception is raised to indicate an unsupported feature.
(7) scrapy.exceptions.StopDownload(fail=True) ,Raised from a bytes_received or headers_received signal handler to indicate that no further bytes should be downloaded for a response.
The fail boolean parameter controls which method will handle the resulting response:

If fail=True (default), the request errback is called. The response object is available as the response attribute of the StopDownload exception, which is in turn stored as the value attribute of the received Failure object. This means that in an errback defined as def errback(self, failure), the response can be accessed though failure.value.response.

If fail=False, the request callback is called instead.

In both cases, the response could have its body truncated: the body contains all bytes received up until the exception is raised, including the bytes received in the signal handler that raises the exception. Also, the response object is marked with "download_stopped" in its Response.flags attribute.

Note: fail is a keyword-only parameter, i.e. raising StopDownload(False) or StopDownload(True) will raise a TypeError.

See the documentation for the bytes_received and headers_received signals and the Stopping the download of a Response topic for additional information and examples.

15.logging

scrapy.log是从python自带的logging衍生而来,但是舍弃了很多部分。

logging模块里有5种log等级,从高到低依次为: logging.CRITICALlogging.ERRORlogging.WARNINGlogging.INFOlogging.DEBUG 。原生的logging模块使用例如下:

import logging
logging.warning("This is a warning")
logging.log(logging.WARNING, "This is a warning")

使用logging模块时,常用的做法是为每个模块单独开一个log文件(单独的log handler),如下:

import logging
logger = logging.getLogger(__name__)
logger.warning("This is a warning")

在scrapy中,爬虫对象有自己的logger,使用例如下:

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'
    start_urls = ['https://scrapy.org']

    def parse(self, response):
        self.logger.info('Parse function called on %s', response.url)

这个logger handler使用的是爬虫的name。然后,在settings.py文件中对log进行一些设置,可选项有:

LOG_FILE
LOG_FILE_APPEND
LOG_ENABLED
LOG_ENCODING
LOG_LEVEL
LOG_FORMAT
LOG_DATEFORMAT
LOG_STDOUT
LOG_SHORT_NAMES

详细的解释见 Python,爬虫与深度学习(11)——scrapy中settings的可选项 ,其余的就不多介绍了。

16.Stats Collection

收集爬虫的状态信息,这个部分在分布式爬虫中用的更多,但是有现成的分布式scrapy(scrapy-redis、scrapyd),这里就不介绍了。

17.Sending e-mail

简单来说就是scrapy提供了发送邮件的功能。但是其实python自己的smtplib用来发邮件绰绰有余,另外邮件更多用于分布式,最后即使是分布式scrapy更多用的也是网页UI,方便快捷,笔者感觉发送邮件的功能有些鸡肋(除非你需要在上班时摸鱼查看爬虫发送来的邮件),但还是介绍一下。

使用例如下:

from scrapy.mail import MailSender
mailer = MailSender()
#使用构造函数生成
mailer = MailSender.from_settings(settings)
#使用from_settings方法,从settings生成
mailer.send(to=["someone@example.com"], subject="Some subject", body="Some body", cc=["another@example.com"])
#发送邮件的方法

MailSender的构造函数为 scrapy.mail.MailSender(smtphost=None, mailfrom=None, smtpuser=None, smtppass=None, smtpport=None) ,参数有:
(1) smtphost (str or bytes) – the SMTP host to use for sending the emails. If omitted, the MAIL_HOST setting will be used.
(2) mailfrom (str) – the address used to send emails (in the From: header). If omitted, the MAIL_FROM setting will be used.
(3) smtpuser – the SMTP user. If omitted, the MAIL_USER setting will be used. If not given, no SMTP authentication will be performed.
(4) smtppass (str or bytes) – the SMTP pass for authentication.
(5) smtpport (int) – the SMTP port to connect to
(6) smtptls (bool) – enforce using SMTP STARTTLS
(7) smtpssl (bool) – enforce using a secure SSL connection
如果对smtp邮件服务感到陌生,可参考 Docker树莓派实践——Typecho及其备份 这篇文章中typecho的CommentNotifier插件这部分入门。

18.Telnet Console

scrapy提供了telent控制台来连接、操作scrapy项目,这在分布式中用的更多,这里也不介绍了。

19.FAQ

官方文档提供的一些Frequently Asked Questions,复制如下:

How does Scrapy compare to BeautifulSoup or lxml?¶
BeautifulSoup and lxml are libraries for parsing HTML and XML. Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them.

Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them. After all, they’re just parsing libraries which can be imported and used from any Python code.

In other words, comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django.

Can I use Scrapy with BeautifulSoup?¶
Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks. You just have to feed the response’s body into a BeautifulSoup object and extract whatever data you need from it.

Here’s an example spider using BeautifulSoup API, with lxml as the HTML parser:

from bs4 import BeautifulSoup
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        # use lxml to get decent HTML parsing speed
        soup = BeautifulSoup(response.text, 'lxml')
        yield {
            "url": response.url,
            "title": soup.h1.string
        }
Note

BeautifulSoup supports several HTML/XML parsers. See BeautifulSoup’s official documentation on which ones are available.

Did Scrapy “steal” X from Django?¶
Probably, but we don’t like that word. We think Django is a great open source project and an example to follow, so we’ve used it as an inspiration for Scrapy.

We believe that, if something is already done well, there’s no need to reinvent it. This concept, besides being one of the foundations for open source and free software, not only applies to software but also to documentation, procedures, policies, etc. So, instead of going through each problem ourselves, we choose to copy ideas from those projects that have already solved them properly, and focus on the real problems we need to solve.

We’d be proud if Scrapy serves as an inspiration for other projects. Feel free to steal from us!

Does Scrapy work with HTTP proxies?¶
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.

How can I scrape an item with attributes in different pages?¶
See Passing additional data to callback functions.

Scrapy crashes with: ImportError: No module named win32api¶
You need to install pywin32 because of this Twisted bug.

How can I simulate a user login in my spider?¶
See Using FormRequest.from_response() to simulate a user login.

Does Scrapy crawl in breadth-first or depth-first order?¶
By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases.

If you do want to crawl in true BFO order, you can do it by setting the following settings:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
While pending requests are below the configured values of CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN or CONCURRENT_REQUESTS_PER_IP, those requests are sent concurrently. As a result, the first few requests of a crawl rarely follow the desired order. Lowering those settings to 1 enforces the desired order, but it significantly slows down the crawl as a whole.

My Scrapy crawler has memory leaks. What can I do?¶
See Debugging memory leaks.

Also, Python has a builtin memory leak issue which is described in Leaks without leaks.

How can I make Scrapy consume less memory?¶
See previous question.

How can I prevent memory errors due to many allowed domains?¶
If you have a spider with a long list of allowed_domains (e.g. 50,000+), consider replacing the default OffsiteMiddleware spider middleware with a custom spider middleware that requires less memory. For example:

If your domain names are similar enough, use your own regular expression instead joining the strings in allowed_domains into a complex regular expression.

If you can meet the installation requirements, use pyre2 instead of Python’s re to compile your URL-filtering regular expression. See issue 1908.

See also other suggestions at StackOverflow.

Note

Remember to disable scrapy.spidermiddlewares.offsite.OffsiteMiddleware when you enable your custom implementation:

SPIDER_MIDDLEWARES = {
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
    'myproject.middlewares.CustomOffsiteMiddleware': 500,
}
Can I use Basic HTTP Authentication in my spiders?¶
Yes, see HttpAuthMiddleware.

Why does Scrapy download pages in English instead of my native language?¶
Try changing the default Accept-Language request header by overriding the DEFAULT_REQUEST_HEADERS setting.

Where can I find some example Scrapy projects?¶
See Examples.

Can I run a spider without creating a project?¶
Yes. You can use the runspider command. For example, if you have a spider written in a my_spider.py file you can run it with:

scrapy runspider my_spider.py
See runspider command for more info.

I get “Filtered offsite request” messages. How can I fix them?¶
Those messages (logged with DEBUG level) don’t necessarily mean there is a problem, so you may not need to fix them.

Those messages are thrown by the Offsite Spider Middleware, which is a spider middleware (enabled by default) whose purpose is to filter out requests to domains outside the ones covered by the spider.

For more info see: OffsiteMiddleware.

What is the recommended way to deploy a Scrapy crawler in production?¶
See Deploying Spiders.

Can I use JSON for large exports?¶
It’ll depend on how large your output is. See this warning in JsonItemExporter documentation.

Can I return (Twisted) deferreds from signal handlers?¶
Some signals support returning deferreds from their handlers, others don’t. See the Built-in signals reference to know which ones.

What does the response status code 999 means?¶
999 is a custom response status code used by Yahoo sites to throttle requests. Try slowing down the crawling speed by using a download delay of 2 (or higher) in your spider:

class MySpider(CrawlSpider):

    name = 'myspider'

    download_delay = 2

    # [ ... rest of the spider code ... ]
Or by setting a global download delay in your project with the DOWNLOAD_DELAY setting.

Can I call pdb.set_trace() from my spiders to debug them?¶
Yes, but you can also use the Scrapy shell which allows you to quickly analyze (and even modify) the response being processed by your spider, which is, quite often, more useful than plain old pdb.set_trace().

For more info see Invoking the shell from spiders to inspect responses.

Simplest way to dump all my scraped items into a JSON/CSV/XML file?¶
To dump into a JSON file:

scrapy crawl myspider -O items.json
To dump into a CSV file:

scrapy crawl myspider -O items.csv
To dump into a XML file:

scrapy crawl myspider -O items.xml
For more information see Feed exports

What’s this huge cryptic __VIEWSTATE parameter used in some forms?¶
The __VIEWSTATE parameter is used in sites built with ASP.NET/VB.NET. For more info on how it works see this page. Also, here’s an example spider which scrapes one of these sites.

What’s the best way to parse big XML/CSV data feeds?¶
Parsing big feeds with XPath selectors can be problematic since they need to build the DOM of the entire feed in memory, and this can be quite slow and consume a lot of memory.

In order to avoid parsing all the entire feed at once in memory, you can use the functions xmliter and csviter from scrapy.utils.iterators module. In fact, this is what the feed spiders (see Spiders) use under the cover.

Does Scrapy manage cookies automatically?¶
Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them back on subsequent requests, like any regular web browser does.

For more info see Requests and Responses and CookiesMiddleware.

How can I see the cookies being sent and received from Scrapy?¶
Enable the COOKIES_DEBUG setting.

How can I instruct a spider to stop itself?¶
Raise the CloseSpider exception from a callback. For more info see: CloseSpider.

How can I prevent my Scrapy bot from getting banned?¶
See Avoiding getting banned.

Should I use spider arguments or settings to configure my spider?¶
Both spider arguments and settings can be used to configure your spider. There is no strict rule that mandates to use one or the other, but settings are more suited for parameters that, once set, don’t change much, while spider arguments are meant to change more often, even on each spider run and sometimes are required for the spider to run at all (for example, to set the start url of a spider).

To illustrate with an example, assuming you have a spider that needs to log into a site to scrape data, and you only want to scrape data from a certain section of the site (which varies each time). In that case, the credentials to log in would be settings, while the url of the section to scrape would be a spider argument.

I’m scraping a XML document and my XPath selector doesn’t return any items¶
You may need to remove namespaces. See Removing namespaces.

How to split an item into multiple items in an item pipeline?¶
Item pipelines cannot yield multiple items per input item. Create a spider middleware instead, and use its process_spider_output() method for this purpose. For example:

from copy import deepcopy

from itemadapter import is_item, ItemAdapter

class MultiplyItemsMiddleware:

    def process_spider_output(self, response, result, spider):
        for item in result:
            if is_item(item):
                adapter = ItemAdapter(item)
                for _ in range(adapter['multiply_by']):
                    yield deepcopy(item)
Does Scrapy support IPv6 addresses?¶
Yes, by setting DNS_RESOLVER to scrapy.resolver.CachingHostnameResolver. Note that by doing so, you lose the ability to set a specific timeout for DNS requests (the value of the DNS_TIMEOUT setting is ignored).

How to deal with <class 'ValueError'>: filedescriptor out of range in select() exceptions?¶
This issue has been reported to appear when running broad crawls in macOS, where the default Twisted reactor is twisted.internet.selectreactor.SelectReactor. Switching to a different reactor is possible by using the TWISTED_REACTOR setting.

How can I cancel the download of a given response?¶
In some situations, it might be useful to stop the download of a certain response. For instance, sometimes you can determine whether or not you need the full contents of a response by inspecting its headers or the first bytes of its body. In that case, you could save resources by attaching a handler to the bytes_received or headers_received signals and raising a StopDownload exception. Please refer to the Stopping the download of a Response topic for additional information and examples.

Running runspider I get error: No spider found in file: <filename>¶
This may happen if your Scrapy project has a spider module with a name that conflicts with the name of one of the Python standard library modules, such as csv.py or os.py, or any Python package that you have installed. See issue 2680.

标签: python, scrapy

添加新评论