May 30, 2021 Article blog
When writing reptiles, if we use libraries such as requests, aiohttp, etc., we need to implement the reptiles completely from beginning to end, such as exception handling, crawl scheduling, etc., if we write more, it will be more troublesome. Using existing reptile frameworks can improve the efficiency of writing reptiles, and when it comes to Python's reptile framework, Scrapy deserves to be the most popular and powerful reptile framework.
About scrapy
Scrapy is a Twisted-based asynchronous processing framework, a Python-only reptile framework with a clear architecture, low coupling between modules, and great scalability for flexibility to meet a variety of requirements. We only need to customize and develop a few modules to easily implement a reptile.
The architecture of the scrapy reptile framework is shown in the following image:
It has the following sections:
Scrapy data flow mechanism
The data flow in scrapy is controlled by the engine as follows:
With multiple components collaborating, different components doing their jobs, and components supporting asynchronous processing well, scrapy maximizes network bandwidth and greatly improves data crawling and processing efficiency.
pip install Scrapy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
The installation method refers to the official documentation: https://docs.scrapy.org/en/latest/intro/install.html
After the installation is complete, if you can use the scrapy command properly, the installation is successful.
Scrapy is a framework that has helped us pre-configure many of the components available and the scaffolding we use to write reptiles, i.e. pre-build a project framework on which we can quickly write reptiles.
The Scrapy framework creates a project from the command line, and the commands for creating the project are as follows:
scrapy startproject practice
Once the command is executed, a folder, called praceice, appears in the current running directory, which is a Scrapy project framework on which we can write reptiles.
project/
__pycache__
spiders/
__pycache__
__init__.py
spider1.py
spider2.py
...
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
scrapy.cfg
The functionality of each file is described below:
Target URL: http://quotes.toscrape.com/
Create a project
Create a scrapy project, and the project file can be generated directly with the scrapy command, which looks like this:
scrapy startproject practice
Create a Spider
Spider is a self-defined class that scrapy uses to crawl content from a Web page and parse the results of crawls. This class must inherit the Spider class scrapy.Spider provided by Scrapy, define the name and start request of the Spider, and how to handle the results of the crawl.
Use the command line to create a Spider with the following command:
cd practice
scrapy genspider quotes quotes.toscrape.com
Switch the path to the preactice folder you just created, and then execute the genspider command. T he first argument is the name of Spider, and the second parameter is the domain name of the website. Once executed, there is an additional quotes.py in the spiders folder, spitder, which you just created, as follows:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
pass
You can see that there are three properties in the quotes.py -- name, allowed_domains, and start_urls -- and a method parse.
Create Item
Item is a container that holds crawl data and is used in a similar way to a dictionary. However, Item has more protection than a dictionary to avoid spelling mistakes or field definition errors.
Creating Item requires inheriting scrapy. I tem class, and defines the type scrapy. F ield of Field. Looking at the target site, we can get text, author, tags.
Defining Item, the entry items.py modified as follows:
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
Three fields are defined and the name of the class is changed to QuoteItem, which is used when climbing next.
Resolve Response
The parameter response of the parse method is the result of the link crawl in the start_urls. So in the parse method, we can parse directly what the response variable contains, such as browsing the web source code of the requested result, or further analyzing the source code content, or finding the link in the result to get the next request.
You can see both the data you want to extract and the next link in the page, both of which can be processed.
First look at the structure of the web page, as shown in the figure. E ach page has multiple blocks of class as quote, each containing text, author, tags. So let's find out all the quotes, and then extract the contents of each quote.
Data can be extracted by cSS selector or XPath selector
Use Item
Item is defined above and will be used next. I tem can be understood as a dictionary, but needs to be instantiated when declaring. Then assign each field of Item with the results you just resolved, and finally return Item.
import scrapy
from practice.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response, **kwargs):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
item['text'] = quote.css('.text::text').extract_first()
item['author'] = quote.css('.author::text').extract_first()
item['tags'] = quote.css('.tags .tag::text').extract()
yield item
Follow-up Request
The above operation enables content to be crawled from the initial page. I mplementing page-turning crawls requires finding information from the current page to generate the next request, then finding information on the page of the next request before constructing the next request. This loops back and forth, enabling the entire station to crawl.
Looking at the web source code, you can see that the link
on the next page is
/page/2/
but the full link is:
http://quotes.toscrape.com/page/2/,
the next request can be constructed from this link.
Scrapy is required to construct the
scrapy.Request
。
Here we pass two parameters , url and callback , which are described below:
parse()
above.
Because parse is the method of parsing text, author, tags, and the structure of the next page is the same as that of the page just resolved, we can use the parse method again to parse the page.
"""
@Author :叶庭云
@Date :2020/10/2 11:40
@CSDN :https://blog.csdn.net/fyfugoyfa
"""
import scrapy
from practice.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response, **kwargs):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
item['text'] = quote.css('.text::text').extract_first()
item['author'] = quote.css('.author::text').extract_first()
item['tags'] = quote.css('.tags .tag::text').extract()
yield item
next_page = response.css('.pager .next a::attr("href")').extract_first()
next_url = response.urljoin(next_page)
yield scrapy.Request(url=next_url, callback=self.parse)
Run Next, go to the directory, and run the following command:
scrapy crawl quotes -o quotes.csv
After the command runs, there is an additional
quotes.csv
file in the project that contains everything you just crawled.
The output format also supports many types, such as json, xml, pickle, marshal, and so on, as well as remote outputs such as ftp, s3, and other outputs by customizing ItemExporter.
scrapy crawl quotes -o quotes.json
scrapy crawl quotes -o quotes.xml
scrapy crawl quotes -o quotes.pickle
scrapy crawl quotes -o quotes.marshal
scrapy crawl quotes -o ftp://user:pass@ftp.example.com/path/to/quotes.csv
Where the ftp output needs to be configured correctly with the user name, password, address, and output path, otherwise it will be reported incorrectly.
With the feed Exports provided by scrapy, we can easily output crawl results to files, which should be sufficient for some small projects. However, if you want more complex output, such as output to the database, you have the flexibility to use Item Pileline to do so.
Target URL: http://sc.chinaz.com/tupian/dangaotupian.html
Create a project
scrapy startproject get_img
cd get_img
scrapy genspider img_spider sc.chinaz.com
Construct the request
Define
start_requests()
methods in
img_spider.py
such as crawling a picture of a cake on this site, crawling 10 pages, and generating 10 requests, as follows:
def start_requests(self):
for i in range(1, 11):
if i == 1:
url = 'http://sc.chinaz.com/tupian/dangaotupian.html'
else:
url = f'http://sc.chinaz.com/tupian/dangaotupian_{i}.html'
yield scrapy.Request(url, self.parse)
Write items.py
import scrapy
class GetImgItem(scrapy.Item):
img_url = scrapy.Field()
img_name = scrapy.Field()
Write a img_spider.py The Spider class defines how to crawl a site (or some) including crawling actions (e.g., whether to follow up on a link) and how structured data is extracted from the content of a Web page (crawling item)
"""
@Author :叶庭云
@Date :2020/10/2 11:40
@CSDN :https://blog.csdn.net/fyfugoyfa
"""
import scrapy
from get_img.items import GetImgItem
class ImgSpiderSpider(scrapy.Spider):
name = 'img_spider'
def start_requests(self):
for i in range(1, 11):
if i == 1:
url = 'http://sc.chinaz.com/tupian/dangaotupian.html'
else:
url = f'http://sc.chinaz.com/tupian/dangaotupian_{i}.html'
yield scrapy.Request(url, self.parse)
def parse(self, response, **kwargs):
src_list = response.xpath('//div[@id="container"]/div/div/a/img/@src2').extract()
alt_list = response.xpath('//div[@id="container"]/div/div/a/img/@alt').extract()
for alt, src in zip(alt_list, src_list):
item = GetImgItem() # 生成item对象
# 赋值
item['img_url'] = src
item['img_name'] = alt
yield item
Write a pipeline file pipelines.py
Scrapy provides Pipeline that specializes in downloads, including file downloads and picture downloads. Downloading files and pictures works the same way as grabbing pages, so the download process is efficient with asynchronous and multithreaded support.
from scrapy.pipelines.images import ImagesPipeline # scrapy图片下载器
from scrapy import Request
from scrapy.exceptions import DropItem
class GetImgPipeline(ImagesPipeline):
# 请求下载图片
def get_media_requests(self, item, info):
yield Request(item['img_url'], meta={'name': item['img_name']})
def item_completed(self, results, item, info):
# 分析下载结果并剔除下载失败的图片
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
return item
# 重写file_path方法,将图片以原来的名称和格式进行保存
def file_path(self, request, response=None, info=None):
name = request.meta['name'] # 接收上面meta传递过来的图片名称
file_name = name + '.jpg' # 添加图片后缀名
return file_name
GetImagPipeline is implemented here, inheriting Scrapy's built-in ImagesPipeline, and rewriting the following methods:
Profile settings.py
# setting.py
BOT_NAME = 'get_img'
SPIDER_MODULES = ['get_img.spiders']
NEWSPIDER_MODULE = 'get_img.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.25
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'get_img.pipelines.GetImgPipeline': 300,
}
IMAGES_STORE = './images' # 设置保存图片的路径 会自动创建
Run the program:
# 切换路径到img_spider的目录
scrapy crawl img_spider
The scrapy frame reptile crawls and downloads very quickly.
Check the local images folder and see that the pictures have been downloaded successfully, as shown in the figure:
Up to now, we've generally known the basic architecture of Scrapy and actually created a Scrapy project, written code for instance crawls, and become familiar with the basic use of the scrapy reptile framework. Then you need to learn more about and learn about the use of scrapy and feel its power.
Author: Ye Tingyun
Original link: https://yetingyun.blog.csdn.net/article/details/108217479