Scrapy framework in Python

Scripy framework

1, Introduction

1. Introduction

Scrapy is an asynchronous processing crawler framework based on Twisted implementation, which is written in pure Python language. Scrapy framework is widely used in data acquisition, network monitoring, automatic testing and so on

2. Environment configuration

  1. Install pywin32
    • pip install pywin32
  2. Install wheel
    • pip install wheel
  3. Install twisted
    • pip install twisted
  4. Installing the scratch frame
    • pip install scrapy

3. Common commands

commandformatexplain
startprojectScratch startproject < project name >Create a new project
genspiderScratch genspider < crawler file name > < domain name >New crawler file
runspiderScratch runspider < crawler File >Run a crawler file without creating a project
crawlscrapy crawl <spidername>To run a crawler project, you must create the project
listscrapy listLists all crawler files in the project
viewScratch view < URL address >Open url address from browser
shellScratch shell < URL address >Command line interaction mode
settingsscrapy settingsView the configuration information of the current project

4. Operating principle

4.1 flow chart

4.2 component introduction

  1. Engine

    The engine is responsible for controlling the data flow between all components of the system and triggering events when certain actions occur.

  2. Scheduler

    It is used to accept the request sent by the engine, push it into the queue and return it when the engine requests again It can be imagined as a priority queue of URLs, which determines what the next URL to grab is, and removes duplicate URLs at the same time

  3. Downloader

    It is used to download web content and return it to EGINE. The Downloader is based on the efficient asynchronous model of twisted

  4. Spiders

    It is a class customized by developers. It is responsible for processing all Responses, analyzing and extracting data from them, obtaining the data required by the Item field, submitting the URL to be followed up to the engine and entering the scheduler again

  5. Item pipeline

    After the items are extracted, you are responsible for processing them, mainly including cleaning, validation, persistence (such as saving to the database) and other operations

  6. Download Middleware

    You can think of it as a component that can customize and expand the download function.

  7. Spider Middleware

    Located between EGINE and SPIDERS, it mainly deals with the input (i.e. responses) and output (i.e. requests) of SPIDERS

4.3 operation process

  1. Engine: Hi! Spider, which website do you want to deal with?
  2. Spider: boss wants me to deal with XXXX com.
  3. Engine: give me the first URL that needs to be processed.
  4. Spider: Here you are. The first URL is XXXXXXXX com.
  5. Engine: Hi! Scheduler, I have a request here. Please help me sort and join the team.
  6. Scheduler: OK, it's processing you. Wait a minute.
  7. Engine: Hi! Scheduler, give me your processed request.
  8. Scheduler: Here you are. This is the request I handled
  9. Engine: Hi! Downloader, you can download this request for me according to the settings of the boss's download middleware
  10. Downloader: OK! Here you are. This is a good download. (if it fails: sorry, the request failed to download. Then the engine tells the scheduler that the request failed to download. Please record it and we'll download it later)
  11. Engine: Hi! Spider, this is something that has been downloaded and has been handled according to the boss's download middleware. You can handle it yourself (note! The responses here are handled by the def parse() function by default)
  12. Spider: (after processing the data, for the URL that needs to be followed up), Hi! Engine, I have two results here. This is the URL I need to follow up, and this is the Item data I obtained.
  13. Engine: Hi! I have an item here. Please help me deal with it! Scheduler! This is the URL that needs to be followed up. Please help me deal with it. Then start the cycle from step 4 until all the information needed by the boss is obtained.
  14. Pipeline scheduler: OK, do it now!

Note: the whole program will stop only when the scheduler has no request to process. (for URL s that fail to download, scripy will also download again.)

2, Create project

This example is crawling watercress

1. Modify configuration

LOG_LEVEL = "WARNING"  # Set log level
from fake_useragent import UserAgent
USER_AGENT = UserAgent().random  # Set request header
ROBOTSTXT_OBEY = False  # Whether to comply with robots protocol. The default value is True
ITEM_PIPELINES = {  # Open the pipe
    'myFirstSpider.pipelines.MyfirstspiderPipeline': 300,  # 300 is the weight,
    'myFirstSpider.pipelines.DoubanPipeline': 301,  # The larger the number, the smaller the weight
}

2. Create a project

On the command line, enter:

(scrapy_) D:\programme\Python\scrapy_>scrapy startproject myFirstSpider

(scrapy_) D:\programme\Python\scrapy_>cd myFirstSpider

(scrapy_) D:\programme\Python\scrapy_\myFirstSpider>scrapy genspider douban "douban.com"

3. Define data

Define an extracted structured data (Item)

  1. Open items. In the myFirstSpider directory py
  2. item defines structured data fields to store crawled data. It is a bit like a dictionary in python, but it provides some external protection to reduce errors
  3. You can create a scene Item class, and the definition type is scratch Field class attribute to define an item (which can be understood as a mapping relationship similar to ORM)
  4. Next, create a double class and build the item model
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MyfirstspiderItem(scrapy.Item):  # You can create a class yourself, but you should inherit the script Item class
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


class DoubanItem(scrapy.Item):
    title = scrapy.Field()  # title
    introduce = scrapy.Field()  # introduce

4. Write and extract data

Write a Spider to crawl the website and extract structured data (items)

Write to the defined crawler file:

import scrapy
from ..items import DoubanItem  # Import defined formatted data


class DoubanSpider(scrapy.Spider):
    name = 'douban'  # The distinguished name of the reptile, unique
    # allowed_domains = ['douban.com']  # Allowable crawling range
    # start_urls = ['http://douban.com / '] # initial crawl url
    start_urls = ['https://movie.douban.com/top250 '] # you can define the url to crawl

    def parse(self, response):
        info = response.xpath('//div[@class="info"]')
        for i in info:
            # Film information collection
            item = DoubanItem()
            title = i.xpath("./div[1]/a/span[1]/text()").extract_first()  # Get the first content and extract the selector object through the extract method
            introduce = i.xpath("./div[2]/p[1]//text()").extract() # get all content
            introduce = "".join(j.strip() for j in [i.replace("\\xa0", '') for i in introduce])  # Organize information
            item["title"] = title
            item["introduce"] = introduce

            # Give the obtained data to pipeline
            yield item

5. Store data

Write Item Pipelines to store the extracted items (i.e. structured data)

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class MyfirstspiderPipeline:
    def process_item(self, item, spider):
        return item


class DoubanPipeline:
    # Run this function at the beginning of the crawler file
    def open_spider(self, spider):
        if spider.name == "douban":  # If the data came from the watercress reptile
            print("The crawler is running!")
            self.fp = open("./douban.txt", "w", encoding="utf-8")

    def process_item(self, item, spider):
        if spider.name == "douban":
            self.fp.write(f"title:{item['title']}, Information:{item['introduce']}")  # Save file

    # Run at the end of the crawler
    def close_spider(self, spider):
        if spider.name == "douban":
            print("Crawler finished running!")
            self.fp.close()

6. Run file

(scrapy_) D:\programme\Python\scrapy_\myFirstSpider>scrapy crawl douban 

3, Log printing

1. Log information

Log information level:

  • ERROR: ERROR message
  • WARNING: WARNING
  • INFO: general information
  • DEBUG: DEBUG information

Set the output of log information

LOG_LEVEL = "ERROR"  # Specify the type of log information
LOG_FILE = "log.txt"  # Indicates that the log information is written to the specified file for storage

2. logging module

imoprt logging
logger = logging.getLogger(__name__)  # __ name__  Get the file name of the project
logger.warning(" info ")  # Print log information to be output

4, Total station crawling

1. Queue using request sorting

yield scrapy.Request(url=new_url, callback=self.parse_taoche, meta={"page": page})

Parameters:

  • url: address of delivery
  • callback: the processing function of the response data after the request
  • meta: transfer data
    • Each request carries a meta parameter
    • Pass to response
    • You can use response meta \ response. Meta ["page"] get
import scrapy, logging
from ..items import DetailItem

logger = logging.Logger(__name__)


class DoubanSpider(scrapy.Spider):
    name = 'douban'
    # allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        print(response)
        info = response.xpath('//div[@class="info"]')
        for i in info:
            item_detail = DetailItem()  # Contents of the details page
            # Film information collection
            title = i.xpath("./div[1]/a/span[1]/text()").extract_first()  # Get the first content and extract the selector object through the extract method
            item_detail["title"] = title
            logger.warning(title)

            detail_url = i.xpath("./div[1]/a/@href").extract_first()  # Get the url of the details page
            # print(detail_url)
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta=item_detail)  # Pass the request to the scheduler and re request

        next_url = response.xpath("//div[@class='paginator']/span[3]/a/@href").extract_first() # get the url of the next page
        if next_url:
            next_url = "https://movie.douban.com/top250" + next_url
            # print(next_url)
            yield scrapy.Request(url=next_url, callback=self.parse, )  # Pass the request to the scheduler and re request

    def parse_detail(self, resp):
        item = resp.meta  # Receive structured data

        introduce = resp.xpath("//Div [@ id ='link report '] / span [1] / span / / text() "). Extract() # get introduction
        item["introduce"] = introduce
        logger.warning(introduce)

        content = resp.xpath("//div[@id='hot-comments']/div[1]//text()").extract() # get comments
        item["content"] = content
        logger.warning(content)

        yield item

2. Inherit crawlspider

There are two types of crawlers in the Scrapy framework:

  • Spider
  • CrawlSpider:
    • Crawlespider is a derived class of Spider. The design principle of Spider class refers to crawling start_ The web page in the URL list, and the CrwalSpider class defines some rules to provide a convenient mechanism to follow up the link. It is more reasonable to get the link from the crawled web page and continue to crawl

Creation method:

scrapy genspider -t crawl Project name website

When created, it displays as

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class FuhaoSpider(CrawlSpider):
    name = 'fuhao'
    # allowed_domains = ['fuhao.com']
    start_urls = ['https://www.phb123.com/renwu/fuhao/shishi_1.html']

    rules = (
        Rule(
            LinkExtractor(allow=r'shishi_\d+.html'),  # The link extractor extracts the url according to the regular rules
            callback='parse_item',  # Specify callback function
            follow=True  # Does the obtained response page go through rules again to extract the url address
        ),
    )
    

    def parse_item(self, response):
        print(response.request.url)

Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True):

  • LinkExtractor: a link extractor that extracts url addresses according to regular rules
  • Callback: the extracted url address sends a request to obtain a response, and the response object will be processed by the function specified by callback
  • follow: whether the obtained response page goes through rules again to extract the url address
# Matching watercress
start_urls = ['https://movie.douban.com/top250?start=0&filter=']
rules = (
    Rule(LinkExtractor(allow=r'?start=\d+&filter='), callback='parse_item', follow=True),
)

5, Binary file

1. Picture download

ImagesPipeLine: image download module

In pipeline, write code (it is known that the download address of the picture is transmitted in item)

import logging
import scrapy
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline


# Inherit ImagesPipeLine
class PicPipeLine(ImagesPipeline):
    # Initiate a request according to the picture address
    def get_media_requests(self, item, info):
        src = item["src"]  # item["src] stores the address of the picture
        logging.warning("Accessing pictures:", src)
        yield scrapy.Request(url = src,meta={'item':item})  # Request for picture

    # Specifies the name of the picture
    def file_path(self, request, response=None, info=None, *, item=None):
        item = request.meta['item']  # Receive meta parameters
        return request.url.split("/")[-1]  # Set file name
        # stay settings Medium setting IMAGES_STORE = "./imags"  # Set the folder where pictures are saved

    # Return data to the next pipeline class to be executed
    def item_completed(self, results, item, info):
        return item  

6, Middlewars

1. Download Middleware

Replace the proxy IP, Cookies, user agent and retry automatically

In settings PY

# Establish ip pool
PROXY_LIST = []

In middlewars PY

from fake_useragent import UserAgent
import random
class Spider4DownloaderMiddleware:
    # Block all requests
    def process_request(self, request, spider):
        # UA camouflage
        request.headers["User-Agent"] = UserAgent().random
        return None

    # When processing the request, the response information can be tampered with
    def process_response(self, request, response, spider):
        bro = spider.bro
        if request.url in spider.model_urls:
            # print(request.url)
            # To tamper with the response object of the request request, response

            bro.get(request.url)

            # Execute js code
            bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
            # After pulling to the end, we found that our scroll bar was still in the middle

            bottom = []  # An empty list indicates that there is no bottom
            while not bottom:  # bool([]) ==> false not false
                bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')

                page_text = bro.page_source  # Get page content
                # If at all, the cycle ends
                bottom = re.findall(r'<div class="load_more_tip" style="display: block;">:-\)It's the end~</div>', page_text)
                time.sleep(1)

                if not bottom:
                    try:
                        bro.find_element(By.CSS_SELECTOR, '.load_more_btn').click()  # Find and load more and click
                    except:
                        bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
                        
            return HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request)
        return response

    # Handle exceptions and execute this function when the network request fails
    def process_exception(self, request, exception, spider):
        # Add proxy ip
        type_ = request.url.split(":")[0]
        request.meta['proxy'] = f"{type_}://{random.choice(spider.settings.get('PROXY_LIST'))}"
        return request  # If the ip is blocked, the proxy ip is used to resend the request
	
    # Execute when starting crawler
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

After setting and downloading the middleware, open it in the settings configuration file

2. Crawler Middleware

The usage of crawler middleware is very similar to that of downloader middleware, but their function objects are different. The function object of downloader middleware is to request request and return response; The function object of crawler middleware is crawler, more specifically, each file written under spiders folder

  1. When running to yield sweep Request () or yield item, the process of crawler Middleware_ spider_ The output () method is called
  2. When an Exception occurs in the code of the crawler itself, the process of the crawler Middleware_ spider_ The Exception () method is called
  3. When a callback function parse in the crawler_ Before XXX () is called, the process of crawler Middleware_ spider_ The input () method is called
  4. When running to start_requests(), the process of the crawler Middleware_ start_ The requests () method is called
import scrapy


class Spider5SpiderMiddleware:
    # After the downloader middleware completes processing, it will immediately enter a callback function parse_ Top note of XXX ()
    def process_spider_input(self, response, spider):
        return None

    # Run yield item or yield sweep on the crawler Called on request()
    def process_spider_output(self, response, result, spider):
        for item in result:
            print(result)
            if isinstance(item, scrapy.Item):
                # Here, you can perform various operations on the item s to be submitted to pipeline
                print(f'item Will be submitted to pipeline')
            yield item  # You can also use yield request. When it is yield request, you can modify the request information, such as meta, etc

    # Called when an error is reported during the running of the crawler
    def process_spider_exception(self, response, exception, spider):
        """
        If a parameter error is found in the crawler, it will be used raise This keyword manually throws a custom exception. In actual crawler development, it can be deliberately not used in some places try ...
        except Instead of catching an exception, let the exception be thrown directly. for example XPath For the result of matching processing, directly read the value inside without first judging whether the list is empty. In this way, if the list is empty, it will be thrown a IndexError,
        So we can let the process of crawler enter the process of crawler middleware process_spider_exception()in
        """
        print("The first%s Page error, error message:%s" % response.meta["page"], exception)  # Here you can capture exception information or return value

    # When the crawler runs to start_ Called on request
    def process_start_requests(self, start_requests, spider):
        for r in start_requests:
            print(r.text)
            yield r

    # Called when the crawler starts
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

Note: enable the crawler Middleware in the settings configuration file

7, Simulate login

1, cookie

Before the whole framework operates, a start condition is required, which is start_urls, start with start_ When the URLs web page initiates requests, the following scheduler, downloader, crawler and pipeline will operate. So here we can target start_urls start of network request_ The requests method is rewritten to bring in our cookie s

Note: you must use yield to return, or you can't run it

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    # allowed_domains = ['example.com']
    start_urls = ['https://www.baidu.com']

    # Override start_request method, and the scratch starts from here
    def start_requests(self):
        # The first way to add cookie s is to add them directly
        cookie = " "
        cookie_dic = {}
        for i in cookie.split(";"):
            cookie_dic[i.split("=")[0]] = i.split("=")[1]

        # The second way to add cookie s is to add headers
        headers = {
            "cookie": "cookie_info",
            # When using headers to pass in cookies, you should add cookies in settings_ ENABLE = True
        }
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse, headers=headers)  # Add cookies

    def parse(self, response):
        print(response.text)

2. Direct login

Simulate login by passing parameters and accessing interfaces:

How to use the first method:

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    # allowed_domains = ['example.com']
    start_urls = ['https://github.com']

    def parse(self, response):
        # Fill in a large number of login parameters
        post_data = {
            "username": "lzk",
            "password": "123456",
            "time": "123",
            "sad": "asdsad12",
        }
        # Pass the login parameters into the server to verify the login
        # Method 1
        yield scrapy.FormRequest(
            url='https://github.com/session',
            formdata=post_data,
            callback=self.parse_login,
        )

    def parse_login(self, response):
        print(response.text)

How to use the second method

# -*- coding: utf-8 -*-
import scrapy
from scrapy import FormRequest, Request


class ExampleLoginSpider(scrapy.Spider):
    name = "login_"
    # allowed_domains = ["example.webscraping.com"]
    start_urls = ['http://example.webscraping.com/user/profile']
    login_url = 'http://example.webscraping.com/places/default/user/login'

        def start_requests(self):
        # Override start_ The requests method is used to log in
        yield scrapy.Request(
            self.login_url,
            callback=self.login
        )

    def login(self,response):
        formdata = {
      	 	'email': 'liushuo@webscraping.com',
            'password': '12345678'
        	}
        yield FormRequest.from_response(
            response, 
            formdata=formdata, 
            callback=self.parse_login
        )
        
    def parse_login(self, response):
        if 'Welcome Liu' in response.text:
            yield from super().start_requests()  # Inherit start_ The function of requests is to access the page to be accessed
            
    def parse(self, response):
        print(response.text)

8, Distributed crawler

1. Concept

Concept:

  • Multiple machines carry out distributed joint crawling for a project

effect:

  • Increase work units and improve climbing efficiency

realization:

  • Multiple machines share one scheduler
    • Implement a public scheduler
      • First of all, ensure that each machine can be connected. Secondly, it should be able to store, that is, store the url we crawled, that is, the storage function of the database, and use redis
        • You can send the url from the crawler to the engine and the engine to redis
        • You can also send the url from the scheduler to redis
        • Similarly, in the persistent storage, the pipeline can also hand over the item data to redis for storage
      • install
        • pip install scrapy-redis -i https://pypi.com/simple

2. Usage

Add in settings configuration file

# Using scratch_ Redis pipeline is a defined pipeline, which can be called directly
ITEM_PIPELINES = {
   'scrapy_redis.pipelines.RedisPipeline': 300,
}
# Specify redis address
REDIS_HOST = '192.168.45.132'  # redis server address, the virtual machine we use
REDIS_PORT = 6379  # redis port

# Using scratch_ Redis scheduler
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'

# The function of de duplication container class configuration: the set set of redis is used to store the requested fingerprint data, so as to realize the persistence of de duplication
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

# Configure whether the scheduler needs persistence. When the crawler ends, whether to clear the set set of request queue and fingerprint in Redis, and set persistence to True
SCHEDULER_PERSIST = True

Add to crawler file

import scrapy
from ..items import TaoCheItem
from scrapy_redis.spiders import RedisCrawlSpider
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


# Note that if you are using scratch Spider then, when redis distributed is used, it inherits RedisSpider
# If it is CrawlSpider, it inherits RedisCrawlSpider
class TaocheSpider(RedisCrawlSpider):
    name = 'taoche'
    # allowed_domains = ['taoche.com']
    # start_urls = ['https://changsha.taoche.com/bmw/?page=1 '] # starting url should be obtained from redis (public scheduler)

    redis_key = 'taoche'  # Go back to redis and get the data with the key value of taoche
    rules = (
        Rule(LinkExtractor(allow=r'/\?page=\d+'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        car_list = response.xpath('//div[@id="container_base"]/ul/li')
        for car in car_list:
            lazyimg = car.xpath('./div[1]/div/a/img/@src').extract_first()
            lazyimg = 'https:' + lazyimg
            title = car.xpath('./div[2]/a/span/text()').extract_first()
            resisted_date = car.xpath('./div[2]/p/i[1]/text()').extract_first()
            mileage = car.xpath('./div[2]/p/i[2]/text()').extract_first()
            city = car.xpath('./div[2]/p/i[3]/text()').extract_first().replace('\n', '').strip()
            price = car.xpath('./div[2]/div[1]/i[1]//text()').extract()
            price = ''.join(price)
            sail_price = car.xpath('./div[2]/div[1]/i[2]/text()').extract_first()
            print(lazyimg, title, resisted_date, mileage, city, price, sail_price)

Tags: Back-end Python crawler

Posted by paolo on Fri, 15 Apr 2022 14:34:23 +0930