Download files, pictures and Middleware of 3-python crawler

Catalogue of series articles

python crawler directory

preface

Excerpt from corresponding course notes of station B
It is worthy of being the boss of Tsinghua University! Make Python web crawler so simple and clear! From getting started to mastering nanny level tutorial (recommended Collection)

The following is the main content of this article, and the following cases can be used for reference

1, Download files and pictures

Scrapy provides a reusable item pipeline for downloading the files contained in the item (for example, when crawling to the product, you also want to save the corresponding pictures). This writing pipeline has some common places and structures (we call it media pipeline). Generally speaking, you will use Files Pipelines or Images Pipelines.

1. Why choose the method of downloading files using the built-in method of scratch

1. Avoid downloading files that have been downloaded recently.
2. You can easily specify the path of file storage.
3. You can convert the downloaded pictures into a common format. For example, png, jpg
4. Thumbnails can be easily generated
5. It is convenient to detect the width and height of pictures to ensure that they meet the minimum limit.
6. Asynchronous download, very efficient.

2. Files Pipelines for downloading files

When downloading files using files pipelines, follow the steps below:
1. Define an item, and then define two attributes in the item, namely file_urls and files, file_urls is the url link used to store the images to be downloaded. You need to give a list.
2. After the file download is completed, the information related to the file download will be stored in the files attribute of item. For example, download path, Download url and file verification code.
3. In the configuration file setting Configure files in PY_ Store, this configuration is used to set the path to download files.
4. Start pipeline: in the file setting Item in PY_ Pipeline sets the scene pipeline. files. FilesPipeline:1 .

3. Images Pipeline for downloading images:

When downloading files using Images Pipeline, follow the steps below:
1. Define an item, and then define two attributes in the item, namely image_urls and images. image_urls is the url link used to store the images to be downloaded. You need to give a list.
2. After the file download is completed, the information related to the file download will be stored in the images attribute of item. Such as download path, Download url and image verification.
3. In the configuration file setting PY_ Store, this configuration is used to set the path for downloading images.
4. Start pipeline: in the file setting Item in PY_ Pipeline sets the scene pipeline. images. ImagesPipeline:1 .

4. Car home CRV picture download practice

Download various pictures of CRV by rewriting ImagesPipeline. The code is as follows:
setting.py

import os
 
BOT_NAME = 'crv'
 
SPIDER_MODULES = ['crv.spiders']
NEWSPIDER_MODULE = 'crv.spiders'
 
ROBOTSTXT_OBEY = False
 
DOWNLOAD_DELAY = 1
 
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36',
}
 
ITEM_PIPELINES = {
    'crv.pipelines.CrvImagesPipeline': 1,
}
 
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "images")

items.py

import scrapy
 
class CrvItem(scrapy.Item):
    title = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

pipelines.py

import os
from crv import settings
from urllib import request
from scrapy.pipelines.images import ImagesPipeline
 
 
class CrvImagesPipeline(ImagesPipeline):
 
    def file_path(self, request, response=None, info=None, *, item=None):
        # This method is called when the picture is about to be stored to obtain the storage path of the picture.
        category = item["title"]
        category_path = os.path.join(settings.IMAGES_STORE, category)
        image_name = request.url.split("_")[-1]
        image_path = os.path.join(category_path, image_name)
        print("image_path ; {}".format(image_path))
        return image_path

crv_spider.py

import scrapy
from crv.items import CrvItem
 
class CrvSpiderSpider(scrapy.Spider):
    name = 'crv_spider'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series-s46246/314.html#pvareaid=3454542']
 
    def parse(self, response):
        divs = response.xpath("//div[@class='column grid-16']/div")[2:] # filter out 1 and 2 rows
        for div in divs:
            print(div.xpath(".//div[@class='uibox-title']"))
            title = div.xpath(".//div[@class='uibox-title']/a/text()").get()
            urls = div.xpath(".//ul/li/a/img/@src").getall()
            urls = list(map(response.urljoin, urls))
            item = CrvItem(title=title, image_urls=urls)
            print(item)
            yield item

2, Download Middleware

Downloader middleware is the middleware for communication between engine and downloader. In this middleware, we can set up agents and change request headers to achieve the purpose of anti crawler. To write downloader middleware, you can implement two methods in the downloader. One is process_request(self, request, spider), which is executed before the request is sent, and process_response(self, request, response, spider). This method is executed before the data is downloaded to the engine.

1,process_request(self, request, spider)

This method is executed by the downloader before sending the request. Generally, random proxy ip can be set in this method
1. Parameters:
Request: the request object that sends the request
Spider: the spider object that sends the request

2. Return value:
Return Node: if None is returned, scripy will continue to process the request and execute the corresponding methods in other middleware until the appropriate downloader processing function is called.
Return the response object: Scrapy will not call any other process_ The request method will directly return the response object. The process of the activated Middleware_ The response () method is called when each response returns.
Return request object: no longer use the previous request object to download data, but return data according to the returned request object.
If an exception is thrown in this method, process is called_ Exception method.

2,process_response(self, request, response, spider)

This is the method that the data downloaded by the downloader will execute in the middle of the engine.
1. Parameters:
Request: request object
Response: the response object to be processed
Spider: spider object

2. Return value:
Return the response object: the new response object will be passed to other middleware and finally to the crawler.
Return request object: the downloader link is cut off, and the returned request will be re scheduled by the downloader for download.
If an exception is thrown, the errback method of the request will be called. If this method is not specified, an exception will be thrown

3. Random request header Middleware

When a crawler frequently accesses a page, if the request header remains consistent, it is easy to be found by the server, thus prohibiting the access of the request header. So should we randomly change the request header before accessing this page, so as to avoid the crawler being caught.
The random change request header can be implemented in the download middleware. Select a request header randomly before sending the request to the server. This avoids using one request header all the time. The example code is as follows:
setting.py

#Open middleware
DOWNLOADER_MIDDLEWARES = {
   'useragent.middlewares.UseragentDownloaderMiddleware': 543,
}

middlewares.py

#Random selection request header
class UseragentDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
 
    USER_AGENT = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0",
        "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
    ]
 
    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.
 
        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
 
        user_agent = random.choice(self.USER_AGENT)
        request.headers["User-Agent"] = user_agent
        return None

httpbin.py

#Get the results and print them
import scrapy
 
class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/user-agent']
 
    def parse(self, response):
        ua = response.json()
        print("*" * 30)
        print(ua)
        print("*" * 30)
        return scrapy.Request(self.start_urls[0], dont_filter=True)

4. ip proxy pool Middleware

1. Purchasing agent

  1. Sesame agent https://zhimahttp.com/
  2. Solar agent http://http.taiyangruanjian.com/
  3. Fast agent https://www.kuaidaili.com/
  4. Communication agent https://jahttp.zhimaruanjian.com/

2. Using ip proxy pool

**setting.py **

#Open middleware
ITEM_PIPELINES = {
   'useragent.pipelines.UseragentPipeline': 300,
}

middlewares.py

#Random selection request header
class IpPorxyDownloaderMiddleware(object):
 
    IPPROXY = [
        "http://101.18.121.42:9999",
        "http://175.44.108.56:9999",
        "http://218.88.205.161:3256",
        "http://114.99.9.251:1133",
    ]
 
    def process_request(self, request, spider):
        proxy = random.choice(self.IPPROXY)
        request.meta['proxy'] = proxy

ipproxy.py

#Get the results and print them
import scrapy
 
class IpproxySpider(scrapy.Spider):
    name = 'ipproxy'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/ip']
 
    def parse(self, response):
        ua = response.json()
        print("*" * 30)
        print(ua)
        print("*" * 30)
        return scrapy.Request(self.start_urls[0], dont_filter=True)
# The free agent is used, which is not very stable and can not get the results

3. Exclusive agent pool

Tests using fast agents

class IPProxyDownloadMiddleware(object):
    def process_request(self, request,spider):
        proxy = '111.111.222.222:8080'
        user_password = "123123123:kod231"
        request.meta["proxy"] = proxy
        #bytes
        b64_user_password = base64.b64encode(user_password.encode('utf-8'))
        request.headers["Proxy-Authorization"] = "Basic" + b64_user_password.decode('utf-8')

Download middleware practical cases

Tags: Python crawler Python crawler Middleware

Posted by chocopi on Tue, 21 Dec 2021 20:37:14 +1030