步骤/目录:
1.应用场景与scrapy架构
2.安装scrapy与创建项目
3.使用scrapy爬取http://www.ip3366.net/free/
    (1)创建项目与修改框架
    (2)自建库文件的内容
    (3)运行爬虫

本文首发于个人博客https://lisper517.top/index.php/archives/47/,转载请注明出处。
本文的目的是讲解scrapy的使用方法。
本文写作日期为2022年9月10日。

爬虫写的好,牢房进的早。使用爬虫时不要占用服务器太多流量,否则可能惹祸上身。

笔者的能力有限,也并非CS从业人员,很多地方难免有纰漏或者不符合代码原则的地方,请在评论中指出。

1.应用场景与scrapy架构

在之前 Python,爬虫与深度学习(6)——爬虫详解(二)构建自己的ip池(简易版) 这篇文章中,提到过 https://free.kuaidaili.com/free/ 这个网站,它列出了4700多页免费代理ip。爬取这些ip是比较耗时的,但还算比较简单,因为需要爬取的每个网页的网址,区别只在最后的页码数字不同。
如果我遇到一个网站,我想爬取它的全站数据,也就是说从它的根页面开始,把根页面需要的数据爬取下来,然后提取出根页面上的所有符合条件的超链接(仅提取属于这个网站的,比如下面有备案信息的网页会提供一个到工信部备案的链接,这种链接就不要),依次访问这些链接,对这些链接也都重复 爬取数据、提取链接 这两步,这样一直重复下去,就能爬取该站所有在页面上出现过的链接(有些链接可能在所有页面上都没出现过,这种隐藏起来的链接就没办法了),也就是一般意义上的爬取全站数据。

但是要注意,有些时候网页会故意放一些正常人无法看到的链接(在浏览器页面上无法显示的链接,比如超出显示范围,或者没有写内容的a标签等),爬虫通过阅读源码能轻易找到,但是如果访问到这个页面,网页服务器就会识别到你是爬虫,不再提供服务,这也是一种反爬方法。作为写爬虫的一方,我们能做到的是把提取链接的规则写小一点,少用正则提取页面上的所有链接。

scrapy就是一种用于全站爬取的爬虫框架,关于它组件的介绍见 runoob教程 ,这里简单介绍一下。用scrapy爬取一个网站的流程是,从一个初始url开始,下载器(Downloader)通过url获得响应,拿给爬虫(Spider)解析、提取出信息和其他要访问的url,把信息交给管道(Item Pipeline)进行存储,比如存成txt文件或存入数据库;其他url则在爬虫中以请求的形式送入调度器(Scheduler),等待合适的时机交给下载器继续下载。另外还有下载中间件和Spider中间件,用于自定义配置,比如有时你想用selenium获得请求,那么就在中间件里把获得响应的部分进行修改。另外,4个组件的中心是scrapy引擎,这个核心组件负责调度、传递组件间的信息。

最后,scrapy爬取网页是采用多进程的方法,所以爬起来非常快,但是也要注意可能被封ip。所以在使用scrapy之前,最好手上已经有一定的代理ip。

2.安装scrapy与创建项目

使用 pip install Scrapy 安装scrapy,然后在cmd或终端中,进入放scrapy爬虫项目的文件夹(这个文件夹以后可以专门用来放爬虫),创建项目:

scrapy startproject 爬虫项目名

然后进入项目文件夹,创建爬虫:

scrapy genspider 爬虫名 "爬虫要爬的网址"

网址可以随便写,文件里可以修改。另外,一个项目里可以有多个爬虫。

对项目里的各个文件进行修改,最后使用 scrapy crawl 爬虫名 运行爬虫。下面就展示一下如何修改这些文件。

3.使用scrapy爬取http://www.ip3366.net/free/

这里以 http://www.ip3366.net/free/ 为例,展示一下scrapy的使用,需要爬取该网站的ip。以下的命令在win10机器上运行。

(1)创建项目与修改框架

新建一个文件夹,比如 D:\spiders\scrapy ,然后进行如下操作:

scrapy startproject ipool
cd ipool
scrapy genspider ip3366 http://www.ip3366.net/free/

在创建过程中,可以看到scrapy有框架模板、爬虫模板,目前使用的是默认模板。以后对爬虫若有一定要求,可编写自己的模板。
然后在资源管理器中进入 D:\spiders\scrapy\ipool\ipool ,可以看到items、middlewares、pipelines、settings,和ip3366这五个py文件,接下来修改这几个文件。

items.py原内容如下:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class IpoolItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

这个文件的修改很简单,它用于定义爬取字段。将items.py修改如下:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class IpoolItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    ips_dict_list = scrapy.Field()

然后是middlewares.py,就是两个中间件的配置文件,可以进行一些自定义设置,本例中只是添加一下UA。原文件如下:

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class IpoolSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class IpoolDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

在下载中间件里对UA进行设置,另外笔者添加了一些注释:

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


# 爬虫中间件
class IpoolSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


import random


# 下载中间件
class IpoolDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    UA_list = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0'
    ]

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    # 拦截无异常的请求,可进行UA伪装
    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        request.headers['User-Agent'] = random.choice(self.UA_list)
        return None

    # 拦截所有响应,可用于爬动态加载的数据
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    # 拦截发生异常的请求,可进行代理ip设置
    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

接着是pipeline.py,在这里可以处理爬到的数据。管道类可以有多个,最后在setting.py里确定优先级。另外,一定要确定每个pipeline最后都return item,这样一个pipeline把spider传过来的数据(就是item)处理完后才能传给下一个优先级的pipeline。其原文件为:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class IpoolPipeline:
    def process_item(self, item, spider):
        return item

改写为:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from MyPythonLib.ipool_mysql_api import mysql_api
import os


class IpoolPipeline:

    #这个方法在父类中定义过,其只在爬虫开始时调用一次
    def open_spider(self, spider):
        self.fp = open(os.path.join(os.getcwd(), 'ips.txt'),
                       'w',
                       encoding='utf-8')

    #这个方法用于处理数据
    def process_item(self, item, spider):
        for ip_dict in item['ips_dict_list']:
            ip = ip_dict['ip']
            port = ip_dict['port']
            self.fp.write(ip + ':' + str(port) + '\n')
        return item

    #这个方法在父类中定义过,其只在爬虫结束时调用一次
    def close_spider(self, spider):
        self.fp.close()


class MysqlPipeline:

    def open_spider(self, spider):
        self.mysql_api = mysql_api()

    def process_item(self, item, spider):
        self.mysql_api.save_ips(item['ips_dict_list'])
        return item

    def close_spider(self, spider):
        self.mysql_api.check_repeat()

在这里定义了两个管道类,一个把爬到的ip、port存成txt文件,另一个把数据存入mysql。这里的MyPythonLib是笔者的自建库,其文件内容将在下一节展示。

接下来是setting.py,其原文件内容如下:

# Scrapy settings for ipool project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'ipool'

SPIDER_MODULES = ['ipool.spiders']
NEWSPIDER_MODULE = 'ipool.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ipool (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'ipool.middlewares.IpoolSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'ipool.middlewares.IpoolDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'ipool.pipelines.IpoolPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

在这里添加、修改几行:

# Scrapy settings for ipool project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

LOG_LEVEL = 'ERROR'  # 只打印错误信息

BOT_NAME = 'ipool'

SPIDER_MODULES = ['ipool.spiders']
NEWSPIDER_MODULE = 'ipool.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ipool (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False  #不遵守robots.txt

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'ipool.middlewares.IpoolSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'ipool.middlewares.IpoolDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'ipool.pipelines.IpoolPipeline': 2,  #item数据按对应数字从小到大的顺序流经管道
    'ipool.pipelines.MysqlPipeline': 1
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

最后是ip3366.py,这个文件主要用于解析爬到的内容。原文件如下:

import scrapy


class Ip3366Spider(scrapy.Spider):
    name = 'ip3366'
    allowed_domains = ['www.ip3366.net']
    start_urls = ['http://www.ip3366.net/']

    def parse(self, response):
        pass

把它修改为:

import scrapy
from ipool.items import IpoolItem


class Ip3366Spider(scrapy.Spider):
    name = 'ip3366'
    #allowed_domains = ['www.ip3366.net']  # 允许的域名。如果域名在此之外,则不会爬取
    start_urls = [
        'http://www.ip3366.net/free/', 'http://www.ip3366.net/free/?stype=2'
    ]  # 起始url的列表
    base_url = 'http://www.ip3366.net/free/'

    def parse(self, response):
        ip_nodes = response.xpath(r'//*[@id="list"]/table/tbody/tr')
        ips_dict_list = []  #这是一个字典的列表,和mysql_api.save_ips里的格式匹配
        for ip_node in ip_nodes:
            ip_dict = {}
            ip = ip_node.xpath(r'./td[1]/text()').extract()[0].strip()
            port = int(ip_node.xpath(r'./td[2]/text()').extract()[0].strip())
            anonymous = (
                0, 1)['高匿' in ip_node.xpath(r'./td[3]/text()').extract()[0]]
            # (a,b)[条件] 也是python三元表达式的一种写法。注意条件为真时,表达式整体值为b
            type_ = (
                0, 1)['HTTPS' in ip_node.xpath(r'./td[4]/text()').extract()[0]]
            physical_address = ip_node.xpath(
                r'./td[5]/text()').extract()[0].strip()
            response_time = 1000 * int(
                ip_node.xpath(r'./td[6]/text()').extract()[0].strip().replace(
                    '秒', ''))
            response_time = (response_time, 250)[response_time == 0]
            #这个网站中有很多代理ip的延迟为0,不太准确,一律改成250即0.25s
            last_verify = ip_node.xpath(
                r'./td[7]/text()').extract()[0].strip().replace('/', '-')

            ip_dict['ip'] = ip
            ip_dict['port'] = port
            ip_dict['anonymous'] = anonymous
            ip_dict['type'] = type_
            ip_dict['physical_address'] = physical_address
            ip_dict['response_time'] = response_time
            ip_dict['last_verify'] = last_verify
            ips_dict_list.append(ip_dict)
        item = IpoolItem()
        item['ips_dict_list'] = ips_dict_list
        yield item

        next_url = response.xpath(
            r'//*[@id="listnav"]/ul/a[text()="下一页"]/@href').extract()
        if next_url != None and len(next_url) != 0:
            next_url = Ip3366Spider.base_url + next_url[0]
            yield scrapy.Request(url=next_url,
                                 callback=self.parse)  # 回调函数是自己,即递归解析网页

    # 本方法在爬虫结束时被调用
    def closed(self, spider):
        pass

(2)自建库文件的内容

创建自建库的方法见 Python,爬虫与深度学习(7)——建立自己的库 。笔者使用的ipool_mysql_api.py文件内容为:

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import os, logging, time
from MyPythonLib import mysql
from MyPythonLib import log as mylog

if not os.path.isdir(os.path.join(os.getcwd(), 'logs')):
    os.mkdir(os.path.join(os.getcwd(), 'logs'))


class mysql_api():
    """通过这个接口实现提交爬取的ip、取出ip的操作。"""

    conn = mysql.get_conn()
    cur = conn.cursor()  #这个cur是类公用的,用于干一些不需要实例化对象的工作

    def __init__(self, log: logging.Logger = None):
        if log == None:
            log = mylog.get_logger(
                os.path.join(os.getcwd(), 'logs'),
                'mysql_api_log#{}.txt'.format(
                    time.strftime('-%Y-%m-%d_%H_%M_%S')), 'mysql_api')
        self.log = log
        message = 'one of mysql_api object initialized'
        self.log.info(message)
        self.cur = mysql_api.conn.cursor()
        mysql_api.check_table()
        message = 'database ipool and table ipool checked'
        self.log.info(message)

    def __del__(self):
        self.cur.close()
        message = 'one of mysql_api object deleted'
        self.log.info(message)

    @staticmethod
    def check_table():
        """检查一下mysql中是否已经建好了表,若表不存在则新建表。
        
        这里并没有检查:若表存在,表的格式是否正确。
        :数据库的各列含义与默认值:
        :no,自增主键;
        :ip,代理ip("0.0.0.0");
        :port,代理ip端口(0);
        :anonymous,是否为高匿,0为否,1为是,3为其他(3);
        :type,0为HTTP,1为HTTPS,3为其他(3);
        :physical_address,代理ip的地理位置("unknown");
        :response_time,响应时间,单位为ms(-1);
        :last_verify,上次验证该代理ip的时间("1000-01-01 00:00:00");
        :created,该项目创建时间。"""
        cur, conn = mysql_api.cur, mysql_api.conn
        command = '''
        CREATE TABLE IF NOT EXISTS ipool ( 
        no BIGINT UNSIGNED AUTO_INCREMENT, 
        ip VARCHAR(50) NOT NULL DEFAULT "0.0.0.0" COMMENT "IP address", 
        port SMALLINT UNSIGNED NOT NULL DEFAULT 0 COMMENT "port", 
        anonymous BIT(2) NOT NULL DEFAULT 3 COMMENT "whether ip is anonymous", 
        type BIT(2) NOT NULL DEFAULT 3 COMMENT "HTTP or HTTPS or both or others", 
        physical_address VARCHAR(100) NOT NULL DEFAULT "unknown" COMMENT "where is the server", 
        response_time MEDIUMINT NOT NULL DEFAULT -1 COMMENT "response_time in microseconds", 
        last_verify DATETIME NOT NULL DEFAULT '1000-01-01 00:00:00' COMMENT "last verify time", 
        created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, 
        PRIMARY KEY (no));'''
        command = command.replace('\n', '')
        command = command.replace('    ', '')
        try:
            cur.execute("CREATE DATABASE IF NOT EXISTS ipool;")
            cur.execute("USE ipool;")
            cur.execute(command)
            conn.commit()
        except:
            conn.rollback()
            raise Exception('mysql_api.check_table failed')

    def save_ips(self, ips_list: list):
        """存入多个ip(以字典的列表形式提交)。注意这里只是对不存在的数据添加默认值,没有验证数据类型等。
        
        :数据库的各列含义与默认值:
        :no,自增主键;
        :ip,代理ip("0.0.0.0");
        :port,代理ip端口(0);
        :anonymous,是否为高匿,0为否,1为是,3为其他(3);
        :type,0为HTTP,1为HTTPS,3为其他(3);
        :physical_address,代理ip的地理位置("unknown");
        :response_time,响应时间,单位为ms(-1);
        :last_verify,上次验证该代理ip的时间("1000-01-01 00:00:00");
        :created,该项目创建时间。"""
        cur, conn, log = self.cur, mysql_api.conn, self.log
        command = '''INSERT INTO ipool (ip,port,anonymous,type,physical_address,response_time,last_verify) VALUES '''
        if len(ips_list) == 0:
            message = 'no ip from this website'
            log.info(message)
            return
        for ip_dict in ips_list:
            ip, port, anonymous, type_, physical_address, response_time, last_verify = "0.0.0.0", 0, 3, 3, 'unknown', -1, "1000-01-01 00:00:00"
            if 'ip' in ip_dict.keys():
                ip = ip_dict['ip']
            if 'port' in ip_dict.keys():
                port = ip_dict['port']
            if 'anonymous' in ip_dict.keys():
                anonymous = ip_dict['anonymous']
            if 'type' in ip_dict.keys():
                type_ = ip_dict['type']
            if 'physical_address' in ip_dict.keys():
                physical_address = ip_dict['physical_address']
            if 'response_time' in ip_dict.keys():
                response_time = ip_dict['response_time']
            if 'last_verify' in ip_dict.keys():
                last_verify = ip_dict['last_verify']
            command += '("{}",{},{},{},"{}",{},"{}"),'.format(
                ip, port, anonymous, type_, physical_address, response_time,
                last_verify)
            message = 'trying to save ip:port={}:{}'.format(ip, port)
            log.info(message)
        command = command[:-1] + ';'
        try:
            cur.execute(command)
            conn.commit()
            message = 'saving ips successfully'
            log.info(message)
        except:
            conn.rollback()
            log.info('错误命令为:' + command)
            raise Exception('mysql_api.save_ips failed')

    def check_repeat(self):
        """检查爬到的数据里有没有重复的,只检查ip和port都相同的。如果有重复的,保留最新的一条记录,也就是no最大的。"""
        cur, conn, log = mysql_api.cur, mysql_api.conn, self.log
        cur.execute('USE ipool;')
        cur.execute(
            'SELECT ip,port FROM ipool GROUP BY ip,port HAVING COUNT(*)>1;')
        repeated_items = cur.fetchall()
        if len(repeated_items) == 0:
            message = 'no repeated ip:port in ipool'
            log.info(message)
            return
        else:
            message = 'found repeated ip:port {} kinds'.format(
                len(repeated_items))
            log.info(message)
            command = '''
            DELETE FROM ipool WHERE no IN (SELECT no FROM
            (SELECT no FROM ipool WHERE 
            (ip,port) IN (SELECT ip,port FROM ipool GROUP BY ip,port HAVING COUNT(*)>1) 
            AND no NOT IN (SELECT MAX(no) FROM ipool GROUP BY ip,port HAVING COUNT(*)>1)) AS a);'''
            command = command.replace('\n', '')
            command = command.replace('    ', '')
            try:
                cur.execute(command)
                conn.commit()
                message = 'repeated ip:port deleted successfully. '
                log.info(message)
            except:
                conn.rollback()
                raise Exception('mysql_api.check_repeat failed')

    @staticmethod
    def get_ips(random: bool = False,
                anonymous: list = [0, 1, 2, 3],
                type_: list = [0, 1, 2, 3],
                response_time: int = 30000,
                last_verify_interval: str = '48:00:00',
                limit: int = 1000) -> tuple:
        """根据要求返回多个ip,只返回ip和port。
        
        :param random: 是否随机返回ip
        :param anonymous: 以列表形式指定匿名性
        :param type_: 对HTTP还是HTTPS或其他代理
        :param response_time: 响应时间在多少ms之内
        :param last_verify_interval: 用于指定上次验证的时间距现在不超过多久
        :param limit: 最多返回几条代理ip
        :return tuple('ip:port','ip:port', ...)"""
        cur = mysql_api.cur
        cur.execute('USE ipool;')
        anonymous, type_ = tuple(anonymous), tuple(type_)
        command = '''
        SELECT ip,port FROM ipool 
        WHERE anonymous IN {} AND type IN {} 
        AND response_time BETWEEN 0 AND {} AND (NOW()-last_verify)<"{}" 
        ORDER BY response_time,last_verify DESC LIMIT {};
        '''.format(anonymous, type_, response_time, last_verify_interval,
                   limit)
        if random:
            command = '''
            SELECT ip,port FROM (SELECT ip,port FROM ipool 
            WHERE anonymous IN {} AND type IN {} 
            AND response_time BETWEEN 0 AND {} AND (NOW()-last_verify)<"{}" 
            ) ORDER BY RAND() LIMIT {};
            '''.format(anonymous, type_, response_time, last_verify_interval,
                       limit)
        command = command.replace('\n', '')
        command = command.replace('    ', '')
        try:
            cur.execute(command)
        except:
            raise Exception('mysql_api.get_ips failed')
        ips = cur.fetchall()
        ips = tuple(
            [ips[i][0] + ':' + str(ips[i][1]) for i in range(len(ips))])
        return ips

mysql和log这两个自建库的内容见 Python,爬虫与深度学习(7)——建立自己的库

(3)运行爬虫

使用cmd进入 D:\spiders\scrapy\ipool\ipool ,运行 scrapy crawl ip3366 即可开启爬虫。

标签: python, scrapy

添加新评论