步骤/目录：
1.应用场景与scrapy架构
2.安装scrapy与创建项目
3.使用scrapy爬取http://www.ip3366.net/free/
    （1）创建项目与修改框架
    （2）自建库文件的内容
    （3）运行爬虫

本文首发于个人博客https://lisper517.top/index.php/archives/47/，转载请注明出处。
本文的目的是讲解scrapy的使用方法。
本文写作日期为2022年9月10日。

爬虫写的好，牢房进的早。使用爬虫时不要占用服务器太多流量，否则可能惹祸上身。

笔者的能力有限，也并非CS从业人员，很多地方难免有纰漏或者不符合代码原则的地方，请在评论中指出。

1.应用场景与scrapy架构

在之前 Python，爬虫与深度学习（6）——爬虫详解（二）构建自己的ip池（简易版） 这篇文章中，提到过 https://free.kuaidaili.com/free/ 这个网站，它列出了4700多页免费代理ip。爬取这些ip是比较耗时的，但还算比较简单，因为需要爬取的每个网页的网址，区别只在最后的页码数字不同。
如果我遇到一个网站，我想爬取它的全站数据，也就是说从它的根页面开始，把根页面需要的数据爬取下来，然后提取出根页面上的所有符合条件的超链接（仅提取属于这个网站的，比如下面有备案信息的网页会提供一个到工信部备案的链接，这种链接就不要），依次访问这些链接，对这些链接也都重复爬取数据、提取链接这两步，这样一直重复下去，就能爬取该站所有在页面上出现过的链接（有些链接可能在所有页面上都没出现过，这种隐藏起来的链接就没办法了），也就是一般意义上的爬取全站数据。

但是要注意，有些时候网页会故意放一些正常人无法看到的链接（在浏览器页面上无法显示的链接，比如超出显示范围，或者没有写内容的a标签等），爬虫通过阅读源码能轻易找到，但是如果访问到这个页面，网页服务器就会识别到你是爬虫，不再提供服务，这也是一种反爬方法。作为写爬虫的一方，我们能做到的是把提取链接的规则写小一点，少用正则提取页面上的所有链接。

scrapy就是一种用于全站爬取的爬虫框架，关于它组件的介绍见 runoob教程，这里简单介绍一下。用scrapy爬取一个网站的流程是，从一个初始url开始，下载器（Downloader）通过url获得响应，拿给爬虫（Spider）解析、提取出信息和其他要访问的url，把信息交给管道（Item Pipeline）进行存储，比如存成txt文件或存入数据库；其他url则在爬虫中以请求的形式送入调度器（Scheduler），等待合适的时机交给下载器继续下载。另外还有下载中间件和Spider中间件，用于自定义配置，比如有时你想用selenium获得请求，那么就在中间件里把获得响应的部分进行修改。另外，4个组件的中心是scrapy引擎，这个核心组件负责调度、传递组件间的信息。

最后，scrapy爬取网页是采用多进程的方法，所以爬起来非常快，但是也要注意可能被封ip。所以在使用scrapy之前，最好手上已经有一定的代理ip。

2.安装scrapy与创建项目

使用 pip install Scrapy 安装scrapy，然后在cmd或终端中，进入放scrapy爬虫项目的文件夹（这个文件夹以后可以专门用来放爬虫），创建项目：

scrapy startproject 爬虫项目名

然后进入项目文件夹，创建爬虫：

scrapy genspider 爬虫名 "爬虫要爬的网址"

网址可以随便写，文件里可以修改。另外，一个项目里可以有多个爬虫。

对项目里的各个文件进行修改，最后使用 scrapy crawl 爬虫名 运行爬虫。下面就展示一下如何修改这些文件。

3.使用scrapy爬取http://www.ip3366.net/free/

这里以 http://www.ip3366.net/free/ 为例，展示一下scrapy的使用，需要爬取该网站的ip。以下的命令在win10机器上运行。

（1）创建项目与修改框架

新建一个文件夹，比如 D:\spiders\scrapy ，然后进行如下操作：

scrapy startproject ipool
cd ipool
scrapy genspider ip3366 http://www.ip3366.net/free/

在创建过程中，可以看到scrapy有框架模板、爬虫模板，目前使用的是默认模板。以后对爬虫若有一定要求，可编写自己的模板。
然后在资源管理器中进入 D:\spiders\scrapy\ipool\ipool ，可以看到items、middlewares、pipelines、settings，和ip3366这五个py文件，接下来修改这几个文件。

items.py原内容如下：

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class IpoolItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

这个文件的修改很简单，它用于定义爬取字段。将items.py修改如下：

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class IpoolItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    ips_dict_list = scrapy.Field()

然后是middlewares.py，就是两个中间件的配置文件，可以进行一些自定义设置，本例中只是添加一下UA。原文件如下：

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class IpoolSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class IpoolDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

在下载中间件里对UA进行设置，另外笔者添加了一些注释：

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


# 爬虫中间件
class IpoolSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


import random


# 下载中间件
class IpoolDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    UA_list = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0'
    ]

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    # 拦截无异常的请求，可进行UA伪装
    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        request.headers['User-Agent'] = random.choice(self.UA_list)
        return None

    # 拦截所有响应，可用于爬动态加载的数据
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    # 拦截发生异常的请求，可进行代理ip设置
    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

接着是pipeline.py，在这里可以处理爬到的数据。管道类可以有多个，最后在setting.py里确定优先级。另外，一定要确定每个pipeline最后都return item，这样一个pipeline把spider传过来的数据（就是item）处理完后才能传给下一个优先级的pipeline。其原文件为：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class IpoolPipeline:
    def process_item(self, item, spider):
        return item

改写为：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from MyPythonLib.ipool_mysql_api import mysql_api
import os


class IpoolPipeline:

    #这个方法在父类中定义过，其只在爬虫开始时调用一次
    def open_spider(self, spider):
        self.fp = open(os.path.join(os.getcwd(), 'ips.txt'),
                       'w',
                       encoding='utf-8')

    #这个方法用于处理数据
    def process_item(self, item, spider):
        for ip_dict in item['ips_dict_list']:
            ip = ip_dict['ip']
            port = ip_dict['port']
            self.fp.write(ip + ':' + str(port) + '\n')
        return item

    #这个方法在父类中定义过，其只在爬虫结束时调用一次
    def close_spider(self, spider):
        self.fp.close()


class MysqlPipeline:

    def open_spider(self, spider):
        self.mysql_api = mysql_api()

    def process_item(self, item, spider):
        self.mysql_api.save_ips(item['ips_dict_list'])
        return item

    def close_spider(self, spider):
        self.mysql_api.check_repeat()

在这里定义了两个管道类，一个把爬到的ip、port存成txt文件，另一个把数据存入mysql。这里的MyPythonLib是笔者的自建库，其文件内容将在下一节展示。

接下来是setting.py，其原文件内容如下：

# Scrapy settings for ipool project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'ipool'

SPIDER_MODULES = ['ipool.spiders']
NEWSPIDER_MODULE = 'ipool.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ipool (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'ipool.middlewares.IpoolSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'ipool.middlewares.IpoolDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'ipool.pipelines.IpoolPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

在这里添加、修改几行：

# Scrapy settings for ipool project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

LOG_LEVEL = 'ERROR'  # 只打印错误信息

BOT_NAME = 'ipool'

SPIDER_MODULES = ['ipool.spiders']
NEWSPIDER_MODULE = 'ipool.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ipool (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False  #不遵守robots.txt

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'ipool.middlewares.IpoolSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'ipool.middlewares.IpoolDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'ipool.pipelines.IpoolPipeline': 2,  #item数据按对应数字从小到大的顺序流经管道
    'ipool.pipelines.MysqlPipeline': 1
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

最后是ip3366.py，这个文件主要用于解析爬到的内容。原文件如下：

import scrapy


class Ip3366Spider(scrapy.Spider):
    name = 'ip3366'
    allowed_domains = ['www.ip3366.net']
    start_urls = ['http://www.ip3366.net/']

    def parse(self, response):
        pass

把它修改为：

import scrapy
from ipool.items import IpoolItem


class Ip3366Spider(scrapy.Spider):
    name = 'ip3366'
    #allowed_domains = ['www.ip3366.net']  # 允许的域名。如果域名在此之外，则不会爬取
    start_urls = [
        'http://www.ip3366.net/free/', 'http://www.ip3366.net/free/?stype=2'
    ]  # 起始url的列表
    base_url = 'http://www.ip3366.net/free/'

    def parse(self, response):
        ip_nodes = response.xpath(r'//*[@id="list"]/table/tbody/tr')
        ips_dict_list = []  #这是一个字典的列表，和mysql_api.save_ips里的格式匹配
        for ip_node in ip_nodes:
            ip_dict = {}
            ip = ip_node.xpath(r'./td[1]/text()').extract()[0].strip()
            port = int(ip_node.xpath(r'./td[2]/text()').extract()[0].strip())
            anonymous = (
                0, 1)['高匿' in ip_node.xpath(r'./td[3]/text()').extract()[0]]
            # (a,b)[条件] 也是python三元表达式的一种写法。注意条件为真时，表达式整体值为b
            type_ = (
                0, 1)['HTTPS' in ip_node.xpath(r'./td[4]/text()').extract()[0]]
            physical_address = ip_node.xpath(
                r'./td[5]/text()').extract()[0].strip()
            response_time = 1000 * int(
                ip_node.xpath(r'./td[6]/text()').extract()[0].strip().replace(
                    '秒', ''))
            response_time = (response_time, 250)[response_time == 0]
            #这个网站中有很多代理ip的延迟为0，不太准确，一律改成250即0.25s
            last_verify = ip_node.xpath(
                r'./td[7]/text()').extract()[0].strip().replace('/', '-')

            ip_dict['ip'] = ip
            ip_dict['port'] = port
            ip_dict['anonymous'] = anonymous
            ip_dict['type'] = type_
            ip_dict['physical_address'] = physical_address
            ip_dict['response_time'] = response_time
            ip_dict['last_verify'] = last_verify
            ips_dict_list.append(ip_dict)
        item = IpoolItem()
        item['ips_dict_list'] = ips_dict_list
        yield item

        next_url = response.xpath(
            r'//*[@id="listnav"]/ul/a[text()="下一页"]/@href').extract()
        if next_url != None and len(next_url) != 0:
            next_url = Ip3366Spider.base_url + next_url[0]
            yield scrapy.Request(url=next_url,
                                 callback=self.parse)  # 回调函数是自己，即递归解析网页

    # 本方法在爬虫结束时被调用
    def closed(self, spider):
        pass

（2）自建库文件的内容

创建自建库的方法见 Python，爬虫与深度学习（7）——建立自己的库 。笔者使用的ipool_mysql_api.py文件内容为：

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import os, logging, time
from MyPythonLib import mysql
from MyPythonLib import log as mylog

if not os.path.isdir(os.path.join(os.getcwd(), 'logs')):
    os.mkdir(os.path.join(os.getcwd(), 'logs'))


class mysql_api():
    """通过这个接口实现提交爬取的ip、取出ip的操作。"""

    conn = mysql.get_conn()
    cur = conn.cursor()  #这个cur是类公用的，用于干一些不需要实例化对象的工作

    def __init__(self, log: logging.Logger = None):
        if log == None:
            log = mylog.get_logger(
                os.path.join(os.getcwd(), 'logs'),
                'mysql_api_log#{}.txt'.format(
                    time.strftime('-%Y-%m-%d_%H_%M_%S')), 'mysql_api')
        self.log = log
        message = 'one of mysql_api object initialized'
        self.log.info(message)
        self.cur = mysql_api.conn.cursor()
        mysql_api.check_table()
        message = 'database ipool and table ipool checked'
        self.log.info(message)

    def __del__(self):
        self.cur.close()
        message = 'one of mysql_api object deleted'
        self.log.info(message)

    @staticmethod
    def check_table():
        """检查一下mysql中是否已经建好了表，若表不存在则新建表。
        
        这里并没有检查：若表存在，表的格式是否正确。
        :数据库的各列含义与默认值：
        :no，自增主键；
        :ip，代理ip（"0.0.0.0"）；
        :port，代理ip端口（0）；
        :anonymous，是否为高匿，0为否，1为是，3为其他（3）；
        :type，0为HTTP，1为HTTPS，3为其他（3）；
        :physical_address，代理ip的地理位置（"unknown"）；
        :response_time，响应时间，单位为ms（-1）；
        :last_verify，上次验证该代理ip的时间（"1000-01-01 00:00:00"）；
        :created，该项目创建时间。"""
        cur, conn = mysql_api.cur, mysql_api.conn
        command = '''
        CREATE TABLE IF NOT EXISTS ipool ( 
        no BIGINT UNSIGNED AUTO_INCREMENT, 
        ip VARCHAR(50) NOT NULL DEFAULT "0.0.0.0" COMMENT "IP address", 
        port SMALLINT UNSIGNED NOT NULL DEFAULT 0 COMMENT "port", 
        anonymous BIT(2) NOT NULL DEFAULT 3 COMMENT "whether ip is anonymous", 
        type BIT(2) NOT NULL DEFAULT 3 COMMENT "HTTP or HTTPS or both or others", 
        physical_address VARCHAR(100) NOT NULL DEFAULT "unknown" COMMENT "where is the server", 
        response_time MEDIUMINT NOT NULL DEFAULT -1 COMMENT "response_time in microseconds", 
        last_verify DATETIME NOT NULL DEFAULT '1000-01-01 00:00:00' COMMENT "last verify time", 
        created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, 
        PRIMARY KEY (no));'''
        command = command.replace('\n', '')
        command = command.replace('    ', '')
        try:
            cur.execute("CREATE DATABASE IF NOT EXISTS ipool;")
            cur.execute("USE ipool;")
            cur.execute(command)
            conn.commit()
        except:
            conn.rollback()
            raise Exception('mysql_api.check_table failed')

    def save_ips(self, ips_list: list):
        """存入多个ip（以字典的列表形式提交）。注意这里只是对不存在的数据添加默认值，没有验证数据类型等。
        
        :数据库的各列含义与默认值：
        :no，自增主键；
        :ip，代理ip（"0.0.0.0"）；
        :port，代理ip端口（0）；
        :anonymous，是否为高匿，0为否，1为是，3为其他（3）；
        :type，0为HTTP，1为HTTPS，3为其他（3）；
        :physical_address，代理ip的地理位置（"unknown"）；
        :response_time，响应时间，单位为ms（-1）；
        :last_verify，上次验证该代理ip的时间（"1000-01-01 00:00:00"）；
        :created，该项目创建时间。"""
        cur, conn, log = self.cur, mysql_api.conn, self.log
        command = '''INSERT INTO ipool (ip,port,anonymous,type,physical_address,response_time,last_verify) VALUES '''
        if len(ips_list) == 0:
            message = 'no ip from this website'
            log.info(message)
            return
        for ip_dict in ips_list:
            ip, port, anonymous, type_, physical_address, response_time, last_verify = "0.0.0.0", 0, 3, 3, 'unknown', -1, "1000-01-01 00:00:00"
            if 'ip' in ip_dict.keys():
                ip = ip_dict['ip']
            if 'port' in ip_dict.keys():
                port = ip_dict['port']
            if 'anonymous' in ip_dict.keys():
                anonymous = ip_dict['anonymous']
            if 'type' in ip_dict.keys():
                type_ = ip_dict['type']
            if 'physical_address' in ip_dict.keys():
                physical_address = ip_dict['physical_address']
            if 'response_time' in ip_dict.keys():
                response_time = ip_dict['response_time']
            if 'last_verify' in ip_dict.keys():
                last_verify = ip_dict['last_verify']
            command += '("{}",{},{},{},"{}",{},"{}"),'.format(
                ip, port, anonymous, type_, physical_address, response_time,
                last_verify)
            message = 'trying to save ip:port={}:{}'.format(ip, port)
            log.info(message)
        command = command[:-1] + ';'
        try:
            cur.execute(command)
            conn.commit()
            message = 'saving ips successfully'
            log.info(message)
        except:
            conn.rollback()
            log.info('错误命令为：' + command)
            raise Exception('mysql_api.save_ips failed')

    def check_repeat(self):
        """检查爬到的数据里有没有重复的，只检查ip和port都相同的。如果有重复的，保留最新的一条记录，也就是no最大的。"""
        cur, conn, log = mysql_api.cur, mysql_api.conn, self.log
        cur.execute('USE ipool;')
        cur.execute(
            'SELECT ip,port FROM ipool GROUP BY ip,port HAVING COUNT(*)>1;')
        repeated_items = cur.fetchall()
        if len(repeated_items) == 0:
            message = 'no repeated ip:port in ipool'
            log.info(message)
            return
        else:
            message = 'found repeated ip:port {} kinds'.format(
                len(repeated_items))
            log.info(message)
            command = '''
            DELETE FROM ipool WHERE no IN (SELECT no FROM
            (SELECT no FROM ipool WHERE 
            (ip,port) IN (SELECT ip,port FROM ipool GROUP BY ip,port HAVING COUNT(*)>1) 
            AND no NOT IN (SELECT MAX(no) FROM ipool GROUP BY ip,port HAVING COUNT(*)>1)) AS a);'''
            command = command.replace('\n', '')
            command = command.replace('    ', '')
            try:
                cur.execute(command)
                conn.commit()
                message = 'repeated ip:port deleted successfully. '
                log.info(message)
            except:
                conn.rollback()
                raise Exception('mysql_api.check_repeat failed')

    @staticmethod
    def get_ips(random: bool = False,
                anonymous: list = [0, 1, 2, 3],
                type_: list = [0, 1, 2, 3],
                response_time: int = 30000,
                last_verify_interval: str = '48:00:00',
                limit: int = 1000) -> tuple:
        """根据要求返回多个ip，只返回ip和port。
        
        :param random: 是否随机返回ip
        :param anonymous: 以列表形式指定匿名性
        :param type_: 对HTTP还是HTTPS或其他代理
        :param response_time: 响应时间在多少ms之内
        :param last_verify_interval: 用于指定上次验证的时间距现在不超过多久
        :param limit: 最多返回几条代理ip
        :return tuple('ip:port','ip:port', ...)"""
        cur = mysql_api.cur
        cur.execute('USE ipool;')
        anonymous, type_ = tuple(anonymous), tuple(type_)
        command = '''
        SELECT ip,port FROM ipool 
        WHERE anonymous IN {} AND type IN {} 
        AND response_time BETWEEN 0 AND {} AND (NOW()-last_verify)<"{}" 
        ORDER BY response_time,last_verify DESC LIMIT {};
        '''.format(anonymous, type_, response_time, last_verify_interval,
                   limit)
        if random:
            command = '''
            SELECT ip,port FROM (SELECT ip,port FROM ipool 
            WHERE anonymous IN {} AND type IN {} 
            AND response_time BETWEEN 0 AND {} AND (NOW()-last_verify)<"{}" 
            ) ORDER BY RAND() LIMIT {};
            '''.format(anonymous, type_, response_time, last_verify_interval,
                       limit)
        command = command.replace('\n', '')
        command = command.replace('    ', '')
        try:
            cur.execute(command)
        except:
            raise Exception('mysql_api.get_ips failed')
        ips = cur.fetchall()
        ips = tuple(
            [ips[i][0] + ':' + str(ips[i][1]) for i in range(len(ips))])
        return ips

mysql和log这两个自建库的内容见 Python，爬虫与深度学习（7）——建立自己的库 。

（3）运行爬虫

使用cmd进入 D:\spiders\scrapy\ipool\ipool ，运行 scrapy crawl ip3366 即可开启爬虫。

Python，爬虫与深度学习（9）——初识scrapy

1.应用场景与scrapy架构

2.安装scrapy与创建项目

3.使用scrapy爬取http://www.ip3366.net/free/

（1）创建项目与修改框架

（2）自建库文件的内容

（3）运行爬虫

添加新评论

最新文章

热门文章

最近回复

分类

归档

其它