Python,爬虫与深度学习(9)——初识scrapy
步骤/目录:
1.应用场景与scrapy架构
2.安装scrapy与创建项目
3.使用scrapy爬取http://www.ip3366.net/free/
(1)创建项目与修改框架
(2)自建库文件的内容
(3)运行爬虫
本文首发于个人博客https://lisper517.top/index.php/archives/47/
,转载请注明出处。
本文的目的是讲解scrapy的使用方法。
本文写作日期为2022年9月10日。
爬虫写的好,牢房进的早。使用爬虫时不要占用服务器太多流量,否则可能惹祸上身。
笔者的能力有限,也并非CS从业人员,很多地方难免有纰漏或者不符合代码原则的地方,请在评论中指出。
1.应用场景与scrapy架构
在之前 Python,爬虫与深度学习(6)——爬虫详解(二)构建自己的ip池(简易版) 这篇文章中,提到过 https://free.kuaidaili.com/free/ 这个网站,它列出了4700多页免费代理ip。爬取这些ip是比较耗时的,但还算比较简单,因为需要爬取的每个网页的网址,区别只在最后的页码数字不同。
如果我遇到一个网站,我想爬取它的全站数据,也就是说从它的根页面开始,把根页面需要的数据爬取下来,然后提取出根页面上的所有符合条件的超链接(仅提取属于这个网站的,比如下面有备案信息的网页会提供一个到工信部备案的链接,这种链接就不要),依次访问这些链接,对这些链接也都重复 爬取数据、提取链接 这两步,这样一直重复下去,就能爬取该站所有在页面上出现过的链接(有些链接可能在所有页面上都没出现过,这种隐藏起来的链接就没办法了),也就是一般意义上的爬取全站数据。
但是要注意,有些时候网页会故意放一些正常人无法看到的链接(在浏览器页面上无法显示的链接,比如超出显示范围,或者没有写内容的a标签等),爬虫通过阅读源码能轻易找到,但是如果访问到这个页面,网页服务器就会识别到你是爬虫,不再提供服务,这也是一种反爬方法。作为写爬虫的一方,我们能做到的是把提取链接的规则写小一点,少用正则提取页面上的所有链接。
scrapy就是一种用于全站爬取的爬虫框架,关于它组件的介绍见 runoob教程 ,这里简单介绍一下。用scrapy爬取一个网站的流程是,从一个初始url开始,下载器(Downloader)通过url获得响应,拿给爬虫(Spider)解析、提取出信息和其他要访问的url,把信息交给管道(Item Pipeline)进行存储,比如存成txt文件或存入数据库;其他url则在爬虫中以请求的形式送入调度器(Scheduler),等待合适的时机交给下载器继续下载。另外还有下载中间件和Spider中间件,用于自定义配置,比如有时你想用selenium获得请求,那么就在中间件里把获得响应的部分进行修改。另外,4个组件的中心是scrapy引擎,这个核心组件负责调度、传递组件间的信息。
最后,scrapy爬取网页是采用多进程的方法,所以爬起来非常快,但是也要注意可能被封ip。所以在使用scrapy之前,最好手上已经有一定的代理ip。
2.安装scrapy与创建项目
使用 pip install Scrapy
安装scrapy,然后在cmd或终端中,进入放scrapy爬虫项目的文件夹(这个文件夹以后可以专门用来放爬虫),创建项目:
scrapy startproject 爬虫项目名
然后进入项目文件夹,创建爬虫:
scrapy genspider 爬虫名 "爬虫要爬的网址"
网址可以随便写,文件里可以修改。另外,一个项目里可以有多个爬虫。
对项目里的各个文件进行修改,最后使用 scrapy crawl 爬虫名
运行爬虫。下面就展示一下如何修改这些文件。
3.使用scrapy爬取http://www.ip3366.net/free/
这里以 http://www.ip3366.net/free/ 为例,展示一下scrapy的使用,需要爬取该网站的ip。以下的命令在win10机器上运行。
(1)创建项目与修改框架
新建一个文件夹,比如 D:\spiders\scrapy
,然后进行如下操作:
scrapy startproject ipool
cd ipool
scrapy genspider ip3366 http://www.ip3366.net/free/
在创建过程中,可以看到scrapy有框架模板、爬虫模板,目前使用的是默认模板。以后对爬虫若有一定要求,可编写自己的模板。
然后在资源管理器中进入 D:\spiders\scrapy\ipool\ipool
,可以看到items、middlewares、pipelines、settings,和ip3366这五个py文件,接下来修改这几个文件。
items.py原内容如下:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class IpoolItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
这个文件的修改很简单,它用于定义爬取字段。将items.py修改如下:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class IpoolItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
ips_dict_list = scrapy.Field()
然后是middlewares.py,就是两个中间件的配置文件,可以进行一些自定义设置,本例中只是添加一下UA。原文件如下:
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class IpoolSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class IpoolDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
在下载中间件里对UA进行设置,另外笔者添加了一些注释:
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
# 爬虫中间件
class IpoolSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
import random
# 下载中间件
class IpoolDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
UA_list = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0'
]
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
# 拦截无异常的请求,可进行UA伪装
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
request.headers['User-Agent'] = random.choice(self.UA_list)
return None
# 拦截所有响应,可用于爬动态加载的数据
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
# 拦截发生异常的请求,可进行代理ip设置
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
接着是pipeline.py,在这里可以处理爬到的数据。管道类可以有多个,最后在setting.py里确定优先级。另外,一定要确定每个pipeline最后都return item,这样一个pipeline把spider传过来的数据(就是item)处理完后才能传给下一个优先级的pipeline。其原文件为:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class IpoolPipeline:
def process_item(self, item, spider):
return item
改写为:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from MyPythonLib.ipool_mysql_api import mysql_api
import os
class IpoolPipeline:
#这个方法在父类中定义过,其只在爬虫开始时调用一次
def open_spider(self, spider):
self.fp = open(os.path.join(os.getcwd(), 'ips.txt'),
'w',
encoding='utf-8')
#这个方法用于处理数据
def process_item(self, item, spider):
for ip_dict in item['ips_dict_list']:
ip = ip_dict['ip']
port = ip_dict['port']
self.fp.write(ip + ':' + str(port) + '\n')
return item
#这个方法在父类中定义过,其只在爬虫结束时调用一次
def close_spider(self, spider):
self.fp.close()
class MysqlPipeline:
def open_spider(self, spider):
self.mysql_api = mysql_api()
def process_item(self, item, spider):
self.mysql_api.save_ips(item['ips_dict_list'])
return item
def close_spider(self, spider):
self.mysql_api.check_repeat()
在这里定义了两个管道类,一个把爬到的ip、port存成txt文件,另一个把数据存入mysql。这里的MyPythonLib是笔者的自建库,其文件内容将在下一节展示。
接下来是setting.py,其原文件内容如下:
# Scrapy settings for ipool project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'ipool'
SPIDER_MODULES = ['ipool.spiders']
NEWSPIDER_MODULE = 'ipool.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ipool (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'ipool.middlewares.IpoolSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'ipool.middlewares.IpoolDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'ipool.pipelines.IpoolPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
在这里添加、修改几行:
# Scrapy settings for ipool project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
LOG_LEVEL = 'ERROR' # 只打印错误信息
BOT_NAME = 'ipool'
SPIDER_MODULES = ['ipool.spiders']
NEWSPIDER_MODULE = 'ipool.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ipool (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False #不遵守robots.txt
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'ipool.middlewares.IpoolSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'ipool.middlewares.IpoolDownloaderMiddleware': 543,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'ipool.pipelines.IpoolPipeline': 2, #item数据按对应数字从小到大的顺序流经管道
'ipool.pipelines.MysqlPipeline': 1
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
最后是ip3366.py,这个文件主要用于解析爬到的内容。原文件如下:
import scrapy
class Ip3366Spider(scrapy.Spider):
name = 'ip3366'
allowed_domains = ['www.ip3366.net']
start_urls = ['http://www.ip3366.net/']
def parse(self, response):
pass
把它修改为:
import scrapy
from ipool.items import IpoolItem
class Ip3366Spider(scrapy.Spider):
name = 'ip3366'
#allowed_domains = ['www.ip3366.net'] # 允许的域名。如果域名在此之外,则不会爬取
start_urls = [
'http://www.ip3366.net/free/', 'http://www.ip3366.net/free/?stype=2'
] # 起始url的列表
base_url = 'http://www.ip3366.net/free/'
def parse(self, response):
ip_nodes = response.xpath(r'//*[@id="list"]/table/tbody/tr')
ips_dict_list = [] #这是一个字典的列表,和mysql_api.save_ips里的格式匹配
for ip_node in ip_nodes:
ip_dict = {}
ip = ip_node.xpath(r'./td[1]/text()').extract()[0].strip()
port = int(ip_node.xpath(r'./td[2]/text()').extract()[0].strip())
anonymous = (
0, 1)['高匿' in ip_node.xpath(r'./td[3]/text()').extract()[0]]
# (a,b)[条件] 也是python三元表达式的一种写法。注意条件为真时,表达式整体值为b
type_ = (
0, 1)['HTTPS' in ip_node.xpath(r'./td[4]/text()').extract()[0]]
physical_address = ip_node.xpath(
r'./td[5]/text()').extract()[0].strip()
response_time = 1000 * int(
ip_node.xpath(r'./td[6]/text()').extract()[0].strip().replace(
'秒', ''))
response_time = (response_time, 250)[response_time == 0]
#这个网站中有很多代理ip的延迟为0,不太准确,一律改成250即0.25s
last_verify = ip_node.xpath(
r'./td[7]/text()').extract()[0].strip().replace('/', '-')
ip_dict['ip'] = ip
ip_dict['port'] = port
ip_dict['anonymous'] = anonymous
ip_dict['type'] = type_
ip_dict['physical_address'] = physical_address
ip_dict['response_time'] = response_time
ip_dict['last_verify'] = last_verify
ips_dict_list.append(ip_dict)
item = IpoolItem()
item['ips_dict_list'] = ips_dict_list
yield item
next_url = response.xpath(
r'//*[@id="listnav"]/ul/a[text()="下一页"]/@href').extract()
if next_url != None and len(next_url) != 0:
next_url = Ip3366Spider.base_url + next_url[0]
yield scrapy.Request(url=next_url,
callback=self.parse) # 回调函数是自己,即递归解析网页
# 本方法在爬虫结束时被调用
def closed(self, spider):
pass
(2)自建库文件的内容
创建自建库的方法见 Python,爬虫与深度学习(7)——建立自己的库 。笔者使用的ipool_mysql_api.py文件内容为:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import os, logging, time
from MyPythonLib import mysql
from MyPythonLib import log as mylog
if not os.path.isdir(os.path.join(os.getcwd(), 'logs')):
os.mkdir(os.path.join(os.getcwd(), 'logs'))
class mysql_api():
"""通过这个接口实现提交爬取的ip、取出ip的操作。"""
conn = mysql.get_conn()
cur = conn.cursor() #这个cur是类公用的,用于干一些不需要实例化对象的工作
def __init__(self, log: logging.Logger = None):
if log == None:
log = mylog.get_logger(
os.path.join(os.getcwd(), 'logs'),
'mysql_api_log#{}.txt'.format(
time.strftime('-%Y-%m-%d_%H_%M_%S')), 'mysql_api')
self.log = log
message = 'one of mysql_api object initialized'
self.log.info(message)
self.cur = mysql_api.conn.cursor()
mysql_api.check_table()
message = 'database ipool and table ipool checked'
self.log.info(message)
def __del__(self):
self.cur.close()
message = 'one of mysql_api object deleted'
self.log.info(message)
@staticmethod
def check_table():
"""检查一下mysql中是否已经建好了表,若表不存在则新建表。
这里并没有检查:若表存在,表的格式是否正确。
:数据库的各列含义与默认值:
:no,自增主键;
:ip,代理ip("0.0.0.0");
:port,代理ip端口(0);
:anonymous,是否为高匿,0为否,1为是,3为其他(3);
:type,0为HTTP,1为HTTPS,3为其他(3);
:physical_address,代理ip的地理位置("unknown");
:response_time,响应时间,单位为ms(-1);
:last_verify,上次验证该代理ip的时间("1000-01-01 00:00:00");
:created,该项目创建时间。"""
cur, conn = mysql_api.cur, mysql_api.conn
command = '''
CREATE TABLE IF NOT EXISTS ipool (
no BIGINT UNSIGNED AUTO_INCREMENT,
ip VARCHAR(50) NOT NULL DEFAULT "0.0.0.0" COMMENT "IP address",
port SMALLINT UNSIGNED NOT NULL DEFAULT 0 COMMENT "port",
anonymous BIT(2) NOT NULL DEFAULT 3 COMMENT "whether ip is anonymous",
type BIT(2) NOT NULL DEFAULT 3 COMMENT "HTTP or HTTPS or both or others",
physical_address VARCHAR(100) NOT NULL DEFAULT "unknown" COMMENT "where is the server",
response_time MEDIUMINT NOT NULL DEFAULT -1 COMMENT "response_time in microseconds",
last_verify DATETIME NOT NULL DEFAULT '1000-01-01 00:00:00' COMMENT "last verify time",
created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (no));'''
command = command.replace('\n', '')
command = command.replace(' ', '')
try:
cur.execute("CREATE DATABASE IF NOT EXISTS ipool;")
cur.execute("USE ipool;")
cur.execute(command)
conn.commit()
except:
conn.rollback()
raise Exception('mysql_api.check_table failed')
def save_ips(self, ips_list: list):
"""存入多个ip(以字典的列表形式提交)。注意这里只是对不存在的数据添加默认值,没有验证数据类型等。
:数据库的各列含义与默认值:
:no,自增主键;
:ip,代理ip("0.0.0.0");
:port,代理ip端口(0);
:anonymous,是否为高匿,0为否,1为是,3为其他(3);
:type,0为HTTP,1为HTTPS,3为其他(3);
:physical_address,代理ip的地理位置("unknown");
:response_time,响应时间,单位为ms(-1);
:last_verify,上次验证该代理ip的时间("1000-01-01 00:00:00");
:created,该项目创建时间。"""
cur, conn, log = self.cur, mysql_api.conn, self.log
command = '''INSERT INTO ipool (ip,port,anonymous,type,physical_address,response_time,last_verify) VALUES '''
if len(ips_list) == 0:
message = 'no ip from this website'
log.info(message)
return
for ip_dict in ips_list:
ip, port, anonymous, type_, physical_address, response_time, last_verify = "0.0.0.0", 0, 3, 3, 'unknown', -1, "1000-01-01 00:00:00"
if 'ip' in ip_dict.keys():
ip = ip_dict['ip']
if 'port' in ip_dict.keys():
port = ip_dict['port']
if 'anonymous' in ip_dict.keys():
anonymous = ip_dict['anonymous']
if 'type' in ip_dict.keys():
type_ = ip_dict['type']
if 'physical_address' in ip_dict.keys():
physical_address = ip_dict['physical_address']
if 'response_time' in ip_dict.keys():
response_time = ip_dict['response_time']
if 'last_verify' in ip_dict.keys():
last_verify = ip_dict['last_verify']
command += '("{}",{},{},{},"{}",{},"{}"),'.format(
ip, port, anonymous, type_, physical_address, response_time,
last_verify)
message = 'trying to save ip:port={}:{}'.format(ip, port)
log.info(message)
command = command[:-1] + ';'
try:
cur.execute(command)
conn.commit()
message = 'saving ips successfully'
log.info(message)
except:
conn.rollback()
log.info('错误命令为:' + command)
raise Exception('mysql_api.save_ips failed')
def check_repeat(self):
"""检查爬到的数据里有没有重复的,只检查ip和port都相同的。如果有重复的,保留最新的一条记录,也就是no最大的。"""
cur, conn, log = mysql_api.cur, mysql_api.conn, self.log
cur.execute('USE ipool;')
cur.execute(
'SELECT ip,port FROM ipool GROUP BY ip,port HAVING COUNT(*)>1;')
repeated_items = cur.fetchall()
if len(repeated_items) == 0:
message = 'no repeated ip:port in ipool'
log.info(message)
return
else:
message = 'found repeated ip:port {} kinds'.format(
len(repeated_items))
log.info(message)
command = '''
DELETE FROM ipool WHERE no IN (SELECT no FROM
(SELECT no FROM ipool WHERE
(ip,port) IN (SELECT ip,port FROM ipool GROUP BY ip,port HAVING COUNT(*)>1)
AND no NOT IN (SELECT MAX(no) FROM ipool GROUP BY ip,port HAVING COUNT(*)>1)) AS a);'''
command = command.replace('\n', '')
command = command.replace(' ', '')
try:
cur.execute(command)
conn.commit()
message = 'repeated ip:port deleted successfully. '
log.info(message)
except:
conn.rollback()
raise Exception('mysql_api.check_repeat failed')
@staticmethod
def get_ips(random: bool = False,
anonymous: list = [0, 1, 2, 3],
type_: list = [0, 1, 2, 3],
response_time: int = 30000,
last_verify_interval: str = '48:00:00',
limit: int = 1000) -> tuple:
"""根据要求返回多个ip,只返回ip和port。
:param random: 是否随机返回ip
:param anonymous: 以列表形式指定匿名性
:param type_: 对HTTP还是HTTPS或其他代理
:param response_time: 响应时间在多少ms之内
:param last_verify_interval: 用于指定上次验证的时间距现在不超过多久
:param limit: 最多返回几条代理ip
:return tuple('ip:port','ip:port', ...)"""
cur = mysql_api.cur
cur.execute('USE ipool;')
anonymous, type_ = tuple(anonymous), tuple(type_)
command = '''
SELECT ip,port FROM ipool
WHERE anonymous IN {} AND type IN {}
AND response_time BETWEEN 0 AND {} AND (NOW()-last_verify)<"{}"
ORDER BY response_time,last_verify DESC LIMIT {};
'''.format(anonymous, type_, response_time, last_verify_interval,
limit)
if random:
command = '''
SELECT ip,port FROM (SELECT ip,port FROM ipool
WHERE anonymous IN {} AND type IN {}
AND response_time BETWEEN 0 AND {} AND (NOW()-last_verify)<"{}"
) ORDER BY RAND() LIMIT {};
'''.format(anonymous, type_, response_time, last_verify_interval,
limit)
command = command.replace('\n', '')
command = command.replace(' ', '')
try:
cur.execute(command)
except:
raise Exception('mysql_api.get_ips failed')
ips = cur.fetchall()
ips = tuple(
[ips[i][0] + ':' + str(ips[i][1]) for i in range(len(ips))])
return ips
mysql和log这两个自建库的内容见 Python,爬虫与深度学习(7)——建立自己的库 。
(3)运行爬虫
使用cmd进入 D:\spiders\scrapy\ipool\ipool
,运行 scrapy crawl ip3366
即可开启爬虫。