IT基地-Python学习记录-Scrapy框架总结补充

李蓝猫

2021-07-11

评论者

Python

常用命令

创建爬虫项目 scrapy startproject 项目名

创建爬虫文件 scrapy genspider  名称 网址

直接运行爬虫文件 scrapy crawl 爬虫名

爬虫持久化启动 scrapy crawl 爬虫名  -s JOBDIR=job/001 （job/001为对应文件夹会自动生成 ctrl+c停止后 重复发送此命令会继续工作）

scrapy打断点运行自定义文件 python 名称.py

from scrapy.cmdline import execute

import sys
import os
#获取父级
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(["scrapy","crawl","爬虫名"])

灵活调试对页面进行Xpath分析 scrapy shell 网址

setting文件设置

需要在setting文件中将 否则会遵循网站协议抓取
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

适用于图片下载

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下载载器在下载同一个网站下一个页面前需要等待的时间,
# 该选项可以用来限制爬取速度,减轻服务器压力。同时也支持小数:0.25 以秒为单位
DOWNLOAD_DELAY = 0

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# 增加额外ImagesPipeline的和对应参数
ITEM_PIPELINES = {
   'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
   'scrapy.pipelines.images.ImagesPipeline':5,   #后面的数字代表执行优先级 ，当执行pipeine的时候会按照数字由小到大执行 
}


#图片存储字段
IMAGES_URLS_FIELD ="front_image_url"  #image_url是在items.py中配置的网络爬取得图片地址
#配置保存本地的地址
project_dir = os.path.abspath(os.path.dirname(__file__))  #获取当前爬虫项目的绝对路径
IMAGES_STORE = os.path.join(project_dir,'images')  #组装新的图片路径
# 90天的图片失效期限
IMAGES_EXPIRES = 90

爬虫加速

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 开启线程数 默认16
CONCURRENT_REQUESTS = 100

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下载载器在下载同一个网站下一个页面前需要等待的时间,
# 该选项可以用来限制爬取速度,减轻服务器压力。同时也支持小数:0.25 以秒为单位
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
# 对单个网站并发请求最大值
CONCURRENT_REQUESTS_PER_DOMAIN = 100
# 对单个IP并发请求最大值
CONCURRENT_REQUESTS_PER_IP = 100

# Disable cookies (enabled by default)
# 除非您 真的 需要，否则请禁止cookies。在进行通用爬取时cookies并不需要，
#  (搜索引擎则忽略cookies)。禁止cookies能减少CPU使用率及Scrapy爬虫在内存中记录的踪迹，提高性能。
COOKIES_ENABLED = False

417 0

417

IT基地

李蓝猫

精彩推荐

暂无评论

文明用语取消回复

李蓝猫

精彩推荐

暂无评论

文明用语 取消回复

文明用语取消回复