What to do when Selenium encounters referer anti-climbing

        Recently, I encountered a problem at work. I used the crawler code written by requests to be used by the company’s operation colleagues to export some product data of the background account, saving them from going to the page one by one to copy the data into the table. Thereby reducing the workload.

        I have repeatedly emphasized that the use of code should not be used too frequently, otherwise the request will be easily rejected by the website, but they must be programmers after all. the back of the head.

Fortunately, the website just rejected the request, and it can still be accessed with a browser.

But this is a new problem. When I use requests to access the URL of the interface, the headers contain the referer field. If the correct referer field is not included, the interface will not return the data I want.

As for why not get the data directly on the page, it is because the data on the page is incomplete, and some of the data we need can only be obtained from the interface.

There are two solutions that come to my mind:

  1. Custom headers header access interface;
  2. Visit the page of this product, and then grab the data packet of the interface I want from the Network of chrome

Solution 1: Custom headers

A method found on the Internet, the code is roughly as follows:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("no-sandbox")
chrome_options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.header_overrides = {"referer": "to be set referer"}
driver.get("interface to access")

        After many attempts, I found that it was useless, and I still couldn’t get the data of the interface, so this plan was rejected by me. I don’t know if there is something wrong with my code. If you find it, please tell me Let me know.

Solution 2: Capture Network packets

It is also the code I found on the Internet. After a little modification, it becomes what I can use, as follows:

import json
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.options import Options

caps = {
    "browserName": "chrome",
    'goog:loggingPrefs': {'performance': 'ALL'}  # Enable log performance monitoring
}
chrome_options = Options()
driver = webdriver.Chrome(desired_capabilities=caps, options=chrome_options)  # start browser
driver.get('Product page to visit')  # visit the url


def filter_type(_type: str):
    types = [
        'application/javascript', 'application/x-javascript', 'text/css', 'webp', 'image/png', 'image/gif',
        'image/jpeg', 'image/x-icon', 'application/octet-stream'
    ]
    if _type not in types:
        return True
    return False

data = ''
while not data:  # End the loop when the desired data is obtained
    performance_log = driver.get_log('performance')  # Get the log named performance
    for packet in performance_log:
        message = json.loads(packet.get('message')).get('message')  # Get message data
        if message.get('method') != 'Network.responseReceived':  # If the method is not a responseReceived type, it will not be executed
            continue
        packet_type = message.get('params').get('response').get('mimeType')  # Get the type returned by the request
        if not filter_type(_type=packet_type):  # filter type
            continue
        requestId = message.get('params').get('requestId')  # Unique request identifier. ID equivalent to the request
        url = message.get('params').get('response').get('url')  # Get the url of the request
        if url != 'Interface to get data url':  # Use url to judge whether the data packet is the one we want
            continue
        try:
            resp = driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': requestId})  # selenium calls cdp
            print(f'type: {packet_type} url: {url}')
            data = resp
            with open('data.json', 'w') as fp:
                fp.write(json.dumps(data))  # store data in file
            break
        except WebDriverException:  # ignore exception
            pass

Since the captured data packets are loaded asynchronously by the webpage, the loading time is affected by the network, so here I added a while loop until the desired data packets are obtained.

  The result comes out after one run. My evaluation is: very easy to use!

The saved data is exactly what I want, very nice!

Option 2 reference source: [Selenium] Selenium gets Network data (advanced version)

Tags: Python crawler Selenium

Posted by Anti-Moronic on Sat, 18 Mar 2023 23:47:36 +1030