Scripy framework
1, Introduction
1. Introduction
Scrapy is an asynchronous processing crawler framework based on Twisted implementation, which is written in pure Python language. Scrapy framework is widely used in data acquisition, network monitoring, automatic testing and so on
2. Environment configuration
- Install pywin32
- pip install pywin32
- Install wheel
- pip install wheel
- Install twisted
- pip install twisted
- Installing the scratch frame
- pip install scrapy
3. Common commands
command | format | explain |
---|---|---|
startproject | Scratch startproject < project name > | Create a new project |
genspider | Scratch genspider < crawler file name > < domain name > | New crawler file |
runspider | Scratch runspider < crawler File > | Run a crawler file without creating a project |
crawl | scrapy crawl <spidername> | To run a crawler project, you must create the project |
list | scrapy list | Lists all crawler files in the project |
view | Scratch view < URL address > | Open url address from browser |
shell | Scratch shell < URL address > | Command line interaction mode |
settings | scrapy settings | View the configuration information of the current project |
4. Operating principle
4.1 flow chart

4.2 component introduction
-
Engine
The engine is responsible for controlling the data flow between all components of the system and triggering events when certain actions occur.
-
Scheduler
It is used to accept the request sent by the engine, push it into the queue and return it when the engine requests again It can be imagined as a priority queue of URLs, which determines what the next URL to grab is, and removes duplicate URLs at the same time
-
Downloader
It is used to download web content and return it to EGINE. The Downloader is based on the efficient asynchronous model of twisted
-
Spiders
It is a class customized by developers. It is responsible for processing all Responses, analyzing and extracting data from them, obtaining the data required by the Item field, submitting the URL to be followed up to the engine and entering the scheduler again
-
Item pipeline
After the items are extracted, you are responsible for processing them, mainly including cleaning, validation, persistence (such as saving to the database) and other operations
-
Download Middleware
You can think of it as a component that can customize and expand the download function.
-
Spider Middleware
Located between EGINE and SPIDERS, it mainly deals with the input (i.e. responses) and output (i.e. requests) of SPIDERS
4.3 operation process
- Engine: Hi! Spider, which website do you want to deal with?
- Spider: boss wants me to deal with XXXX com.
- Engine: give me the first URL that needs to be processed.
- Spider: Here you are. The first URL is XXXXXXXX com.
- Engine: Hi! Scheduler, I have a request here. Please help me sort and join the team.
- Scheduler: OK, it's processing you. Wait a minute.
- Engine: Hi! Scheduler, give me your processed request.
- Scheduler: Here you are. This is the request I handled
- Engine: Hi! Downloader, you can download this request for me according to the settings of the boss's download middleware
- Downloader: OK! Here you are. This is a good download. (if it fails: sorry, the request failed to download. Then the engine tells the scheduler that the request failed to download. Please record it and we'll download it later)
- Engine: Hi! Spider, this is something that has been downloaded and has been handled according to the boss's download middleware. You can handle it yourself (note! The responses here are handled by the def parse() function by default)
- Spider: (after processing the data, for the URL that needs to be followed up), Hi! Engine, I have two results here. This is the URL I need to follow up, and this is the Item data I obtained.
- Engine: Hi! I have an item here. Please help me deal with it! Scheduler! This is the URL that needs to be followed up. Please help me deal with it. Then start the cycle from step 4 until all the information needed by the boss is obtained.
- Pipeline scheduler: OK, do it now!
Note: the whole program will stop only when the scheduler has no request to process. (for URL s that fail to download, scripy will also download again.)
2, Create project
This example is crawling watercress
1. Modify configuration
LOG_LEVEL = "WARNING" # Set log level from fake_useragent import UserAgent USER_AGENT = UserAgent().random # Set request header ROBOTSTXT_OBEY = False # Whether to comply with robots protocol. The default value is True ITEM_PIPELINES = { # Open the pipe 'myFirstSpider.pipelines.MyfirstspiderPipeline': 300, # 300 is the weight, 'myFirstSpider.pipelines.DoubanPipeline': 301, # The larger the number, the smaller the weight }
2. Create a project
On the command line, enter:
(scrapy_) D:\programme\Python\scrapy_>scrapy startproject myFirstSpider (scrapy_) D:\programme\Python\scrapy_>cd myFirstSpider (scrapy_) D:\programme\Python\scrapy_\myFirstSpider>scrapy genspider douban "douban.com"
3. Define data
Define an extracted structured data (Item)
- Open items. In the myFirstSpider directory py
- item defines structured data fields to store crawled data. It is a bit like a dictionary in python, but it provides some external protection to reduce errors
- You can create a scene Item class, and the definition type is scratch Field class attribute to define an item (which can be understood as a mapping relationship similar to ORM)
- Next, create a double class and build the item model
# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class MyfirstspiderItem(scrapy.Item): # You can create a class yourself, but you should inherit the script Item class # define the fields for your item here like: # name = scrapy.Field() pass class DoubanItem(scrapy.Item): title = scrapy.Field() # title introduce = scrapy.Field() # introduce
4. Write and extract data
Write a Spider to crawl the website and extract structured data (items)
Write to the defined crawler file:
import scrapy from ..items import DoubanItem # Import defined formatted data class DoubanSpider(scrapy.Spider): name = 'douban' # The distinguished name of the reptile, unique # allowed_domains = ['douban.com'] # Allowable crawling range # start_urls = ['http://douban.com / '] # initial crawl url start_urls = ['https://movie.douban.com/top250 '] # you can define the url to crawl def parse(self, response): info = response.xpath('//div[@class="info"]') for i in info: # Film information collection item = DoubanItem() title = i.xpath("./div[1]/a/span[1]/text()").extract_first() # Get the first content and extract the selector object through the extract method introduce = i.xpath("./div[2]/p[1]//text()").extract() # get all content introduce = "".join(j.strip() for j in [i.replace("\\xa0", '') for i in introduce]) # Organize information item["title"] = title item["introduce"] = introduce # Give the obtained data to pipeline yield item
5. Store data
Write Item Pipelines to store the extracted items (i.e. structured data)
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter class MyfirstspiderPipeline: def process_item(self, item, spider): return item class DoubanPipeline: # Run this function at the beginning of the crawler file def open_spider(self, spider): if spider.name == "douban": # If the data came from the watercress reptile print("The crawler is running!") self.fp = open("./douban.txt", "w", encoding="utf-8") def process_item(self, item, spider): if spider.name == "douban": self.fp.write(f"title:{item['title']}, Information:{item['introduce']}") # Save file # Run at the end of the crawler def close_spider(self, spider): if spider.name == "douban": print("Crawler finished running!") self.fp.close()
6. Run file
(scrapy_) D:\programme\Python\scrapy_\myFirstSpider>scrapy crawl douban
3, Log printing
1. Log information
Log information level:
- ERROR: ERROR message
- WARNING: WARNING
- INFO: general information
- DEBUG: DEBUG information
Set the output of log information
LOG_LEVEL = "ERROR" # Specify the type of log information LOG_FILE = "log.txt" # Indicates that the log information is written to the specified file for storage
2. logging module
imoprt logging logger = logging.getLogger(__name__) # __ name__ Get the file name of the project logger.warning(" info ") # Print log information to be output
4, Total station crawling
1. Queue using request sorting
yield scrapy.Request(url=new_url, callback=self.parse_taoche, meta={"page": page})
Parameters:
- url: address of delivery
- callback: the processing function of the response data after the request
- meta: transfer data
- Each request carries a meta parameter
- Pass to response
- You can use response meta \ response. Meta ["page"] get
import scrapy, logging from ..items import DetailItem logger = logging.Logger(__name__) class DoubanSpider(scrapy.Spider): name = 'douban' # allowed_domains = ['douban.com'] start_urls = ['https://movie.douban.com/top250'] def parse(self, response): print(response) info = response.xpath('//div[@class="info"]') for i in info: item_detail = DetailItem() # Contents of the details page # Film information collection title = i.xpath("./div[1]/a/span[1]/text()").extract_first() # Get the first content and extract the selector object through the extract method item_detail["title"] = title logger.warning(title) detail_url = i.xpath("./div[1]/a/@href").extract_first() # Get the url of the details page # print(detail_url) yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta=item_detail) # Pass the request to the scheduler and re request next_url = response.xpath("//div[@class='paginator']/span[3]/a/@href").extract_first() # get the url of the next page if next_url: next_url = "https://movie.douban.com/top250" + next_url # print(next_url) yield scrapy.Request(url=next_url, callback=self.parse, ) # Pass the request to the scheduler and re request def parse_detail(self, resp): item = resp.meta # Receive structured data introduce = resp.xpath("//Div [@ id ='link report '] / span [1] / span / / text() "). Extract() # get introduction item["introduce"] = introduce logger.warning(introduce) content = resp.xpath("//div[@id='hot-comments']/div[1]//text()").extract() # get comments item["content"] = content logger.warning(content) yield item
2. Inherit crawlspider
There are two types of crawlers in the Scrapy framework:
- Spider
- CrawlSpider:
- Crawlespider is a derived class of Spider. The design principle of Spider class refers to crawling start_ The web page in the URL list, and the CrwalSpider class defines some rules to provide a convenient mechanism to follow up the link. It is more reasonable to get the link from the crawled web page and continue to crawl
Creation method:
scrapy genspider -t crawl Project name website
When created, it displays as
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class FuhaoSpider(CrawlSpider): name = 'fuhao' # allowed_domains = ['fuhao.com'] start_urls = ['https://www.phb123.com/renwu/fuhao/shishi_1.html'] rules = ( Rule( LinkExtractor(allow=r'shishi_\d+.html'), # The link extractor extracts the url according to the regular rules callback='parse_item', # Specify callback function follow=True # Does the obtained response page go through rules again to extract the url address ), ) def parse_item(self, response): print(response.request.url)
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True):
- LinkExtractor: a link extractor that extracts url addresses according to regular rules
- Callback: the extracted url address sends a request to obtain a response, and the response object will be processed by the function specified by callback
- follow: whether the obtained response page goes through rules again to extract the url address
# Matching watercress start_urls = ['https://movie.douban.com/top250?start=0&filter='] rules = ( Rule(LinkExtractor(allow=r'?start=\d+&filter='), callback='parse_item', follow=True), )
5, Binary file
1. Picture download
ImagesPipeLine: image download module
In pipeline, write code (it is known that the download address of the picture is transmitted in item)
import logging import scrapy from itemadapter import ItemAdapter from scrapy.pipelines.images import ImagesPipeline # Inherit ImagesPipeLine class PicPipeLine(ImagesPipeline): # Initiate a request according to the picture address def get_media_requests(self, item, info): src = item["src"] # item["src] stores the address of the picture logging.warning("Accessing pictures:", src) yield scrapy.Request(url = src,meta={'item':item}) # Request for picture # Specifies the name of the picture def file_path(self, request, response=None, info=None, *, item=None): item = request.meta['item'] # Receive meta parameters return request.url.split("/")[-1] # Set file name # stay settings Medium setting IMAGES_STORE = "./imags" # Set the folder where pictures are saved # Return data to the next pipeline class to be executed def item_completed(self, results, item, info): return item
6, Middlewars
1. Download Middleware
Replace the proxy IP, Cookies, user agent and retry automatically
In settings PY
# Establish ip pool PROXY_LIST = []
In middlewars PY
from fake_useragent import UserAgent import random class Spider4DownloaderMiddleware: # Block all requests def process_request(self, request, spider): # UA camouflage request.headers["User-Agent"] = UserAgent().random return None # When processing the request, the response information can be tampered with def process_response(self, request, response, spider): bro = spider.bro if request.url in spider.model_urls: # print(request.url) # To tamper with the response object of the request request, response bro.get(request.url) # Execute js code bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') # After pulling to the end, we found that our scroll bar was still in the middle bottom = [] # An empty list indicates that there is no bottom while not bottom: # bool([]) ==> false not false bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') page_text = bro.page_source # Get page content # If at all, the cycle ends bottom = re.findall(r'<div class="load_more_tip" style="display: block;">:-\)It's the end~</div>', page_text) time.sleep(1) if not bottom: try: bro.find_element(By.CSS_SELECTOR, '.load_more_btn').click() # Find and load more and click except: bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') return HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request) return response # Handle exceptions and execute this function when the network request fails def process_exception(self, request, exception, spider): # Add proxy ip type_ = request.url.split(":")[0] request.meta['proxy'] = f"{type_}://{random.choice(spider.settings.get('PROXY_LIST'))}" return request # If the ip is blocked, the proxy ip is used to resend the request # Execute when starting crawler def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
After setting and downloading the middleware, open it in the settings configuration file
2. Crawler Middleware
The usage of crawler middleware is very similar to that of downloader middleware, but their function objects are different. The function object of downloader middleware is to request request and return response; The function object of crawler middleware is crawler, more specifically, each file written under spiders folder
- When running to yield sweep Request () or yield item, the process of crawler Middleware_ spider_ The output () method is called
- When an Exception occurs in the code of the crawler itself, the process of the crawler Middleware_ spider_ The Exception () method is called
- When a callback function parse in the crawler_ Before XXX () is called, the process of crawler Middleware_ spider_ The input () method is called
- When running to start_requests(), the process of the crawler Middleware_ start_ The requests () method is called
import scrapy class Spider5SpiderMiddleware: # After the downloader middleware completes processing, it will immediately enter a callback function parse_ Top note of XXX () def process_spider_input(self, response, spider): return None # Run yield item or yield sweep on the crawler Called on request() def process_spider_output(self, response, result, spider): for item in result: print(result) if isinstance(item, scrapy.Item): # Here, you can perform various operations on the item s to be submitted to pipeline print(f'item Will be submitted to pipeline') yield item # You can also use yield request. When it is yield request, you can modify the request information, such as meta, etc # Called when an error is reported during the running of the crawler def process_spider_exception(self, response, exception, spider): """ If a parameter error is found in the crawler, it will be used raise This keyword manually throws a custom exception. In actual crawler development, it can be deliberately not used in some places try ... except Instead of catching an exception, let the exception be thrown directly. for example XPath For the result of matching processing, directly read the value inside without first judging whether the list is empty. In this way, if the list is empty, it will be thrown a IndexError, So we can let the process of crawler enter the process of crawler middleware process_spider_exception()in """ print("The first%s Page error, error message:%s" % response.meta["page"], exception) # Here you can capture exception information or return value # When the crawler runs to start_ Called on request def process_start_requests(self, start_requests, spider): for r in start_requests: print(r.text) yield r # Called when the crawler starts def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
Note: enable the crawler Middleware in the settings configuration file
7, Simulate login
1, cookie
Before the whole framework operates, a start condition is required, which is start_urls, start with start_ When the URLs web page initiates requests, the following scheduler, downloader, crawler and pipeline will operate. So here we can target start_urls start of network request_ The requests method is rewritten to bring in our cookie s
Note: you must use yield to return, or you can't run it
import scrapy class ExampleSpider(scrapy.Spider): name = 'example' # allowed_domains = ['example.com'] start_urls = ['https://www.baidu.com'] # Override start_request method, and the scratch starts from here def start_requests(self): # The first way to add cookie s is to add them directly cookie = " " cookie_dic = {} for i in cookie.split(";"): cookie_dic[i.split("=")[0]] = i.split("=")[1] # The second way to add cookie s is to add headers headers = { "cookie": "cookie_info", # When using headers to pass in cookies, you should add cookies in settings_ ENABLE = True } for url in self.start_urls: yield scrapy.Request(url=url, callback=self.parse, headers=headers) # Add cookies def parse(self, response): print(response.text)
2. Direct login
Simulate login by passing parameters and accessing interfaces:
How to use the first method:
import scrapy class ExampleSpider(scrapy.Spider): name = 'example' # allowed_domains = ['example.com'] start_urls = ['https://github.com'] def parse(self, response): # Fill in a large number of login parameters post_data = { "username": "lzk", "password": "123456", "time": "123", "sad": "asdsad12", } # Pass the login parameters into the server to verify the login # Method 1 yield scrapy.FormRequest( url='https://github.com/session', formdata=post_data, callback=self.parse_login, ) def parse_login(self, response): print(response.text)
How to use the second method
# -*- coding: utf-8 -*- import scrapy from scrapy import FormRequest, Request class ExampleLoginSpider(scrapy.Spider): name = "login_" # allowed_domains = ["example.webscraping.com"] start_urls = ['http://example.webscraping.com/user/profile'] login_url = 'http://example.webscraping.com/places/default/user/login' def start_requests(self): # Override start_ The requests method is used to log in yield scrapy.Request( self.login_url, callback=self.login ) def login(self,response): formdata = { 'email': 'liushuo@webscraping.com', 'password': '12345678' } yield FormRequest.from_response( response, formdata=formdata, callback=self.parse_login ) def parse_login(self, response): if 'Welcome Liu' in response.text: yield from super().start_requests() # Inherit start_ The function of requests is to access the page to be accessed def parse(self, response): print(response.text)
8, Distributed crawler
1. Concept
Concept:
- Multiple machines carry out distributed joint crawling for a project
effect:
- Increase work units and improve climbing efficiency
realization:
- Multiple machines share one scheduler
- Implement a public scheduler
- First of all, ensure that each machine can be connected. Secondly, it should be able to store, that is, store the url we crawled, that is, the storage function of the database, and use redis
- You can send the url from the crawler to the engine and the engine to redis
- You can also send the url from the scheduler to redis
- Similarly, in the persistent storage, the pipeline can also hand over the item data to redis for storage
- install
- pip install scrapy-redis -i https://pypi.com/simple
- First of all, ensure that each machine can be connected. Secondly, it should be able to store, that is, store the url we crawled, that is, the storage function of the database, and use redis
- Implement a public scheduler
2. Usage
Add in settings configuration file
# Using scratch_ Redis pipeline is a defined pipeline, which can be called directly ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 300, } # Specify redis address REDIS_HOST = '192.168.45.132' # redis server address, the virtual machine we use REDIS_PORT = 6379 # redis port # Using scratch_ Redis scheduler SCHEDULER = 'scrapy_redis.scheduler.Scheduler' # The function of de duplication container class configuration: the set set of redis is used to store the requested fingerprint data, so as to realize the persistence of de duplication DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' # Configure whether the scheduler needs persistence. When the crawler ends, whether to clear the set set of request queue and fingerprint in Redis, and set persistence to True SCHEDULER_PERSIST = True
Add to crawler file
import scrapy from ..items import TaoCheItem from scrapy_redis.spiders import RedisCrawlSpider from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule # Note that if you are using scratch Spider then, when redis distributed is used, it inherits RedisSpider # If it is CrawlSpider, it inherits RedisCrawlSpider class TaocheSpider(RedisCrawlSpider): name = 'taoche' # allowed_domains = ['taoche.com'] # start_urls = ['https://changsha.taoche.com/bmw/?page=1 '] # starting url should be obtained from redis (public scheduler) redis_key = 'taoche' # Go back to redis and get the data with the key value of taoche rules = ( Rule(LinkExtractor(allow=r'/\?page=\d+'), callback='parse_item', follow=True), ) def parse_item(self, response): car_list = response.xpath('//div[@id="container_base"]/ul/li') for car in car_list: lazyimg = car.xpath('./div[1]/div/a/img/@src').extract_first() lazyimg = 'https:' + lazyimg title = car.xpath('./div[2]/a/span/text()').extract_first() resisted_date = car.xpath('./div[2]/p/i[1]/text()').extract_first() mileage = car.xpath('./div[2]/p/i[2]/text()').extract_first() city = car.xpath('./div[2]/p/i[3]/text()').extract_first().replace('\n', '').strip() price = car.xpath('./div[2]/div[1]/i[1]//text()').extract() price = ''.join(price) sail_price = car.xpath('./div[2]/div[1]/i[2]/text()').extract_first() print(lazyimg, title, resisted_date, mileage, city, price, sail_price)