Recently, I encountered a problem at work. I used the crawler code written by requests to be used by the company’s operation colleagues to export some product data of the background account, saving them from going to the page one by one to copy the data into the table. Thereby reducing the workload.
I have repeatedly emphasized that the use of code should not be used too frequently, otherwise the request will be easily rejected by the website, but they must be programmers after all. the back of the head.
Fortunately, the website just rejected the request, and it can still be accessed with a browser.
But this is a new problem. When I use requests to access the URL of the interface, the headers contain the referer field. If the correct referer field is not included, the interface will not return the data I want.
As for why not get the data directly on the page, it is because the data on the page is incomplete, and some of the data we need can only be obtained from the interface.
There are two solutions that come to my mind:
- Custom headers header access interface;
- Visit the page of this product, and then grab the data packet of the interface I want from the Network of chrome
Solution 1: Custom headers
A method found on the Internet, the code is roughly as follows:
from selenium import webdriver from selenium.webdriver.chrome.options import Options chrome_options = Options() chrome_options.add_argument('--disable-gpu') chrome_options.add_argument("no-sandbox") chrome_options.add_argument("--disable-extensions") driver = webdriver.Chrome(chrome_options=chrome_options) driver.header_overrides = {"referer": "to be set referer"} driver.get("interface to access")
After many attempts, I found that it was useless, and I still couldn’t get the data of the interface, so this plan was rejected by me. I don’t know if there is something wrong with my code. If you find it, please tell me Let me know.
Solution 2: Capture Network packets
It is also the code I found on the Internet. After a little modification, it becomes what I can use, as follows:
import json from selenium import webdriver from selenium.common.exceptions import WebDriverException from selenium.webdriver.chrome.options import Options caps = { "browserName": "chrome", 'goog:loggingPrefs': {'performance': 'ALL'} # Enable log performance monitoring } chrome_options = Options() driver = webdriver.Chrome(desired_capabilities=caps, options=chrome_options) # start browser driver.get('Product page to visit') # visit the url def filter_type(_type: str): types = [ 'application/javascript', 'application/x-javascript', 'text/css', 'webp', 'image/png', 'image/gif', 'image/jpeg', 'image/x-icon', 'application/octet-stream' ] if _type not in types: return True return False data = '' while not data: # End the loop when the desired data is obtained performance_log = driver.get_log('performance') # Get the log named performance for packet in performance_log: message = json.loads(packet.get('message')).get('message') # Get message data if message.get('method') != 'Network.responseReceived': # If the method is not a responseReceived type, it will not be executed continue packet_type = message.get('params').get('response').get('mimeType') # Get the type returned by the request if not filter_type(_type=packet_type): # filter type continue requestId = message.get('params').get('requestId') # Unique request identifier. ID equivalent to the request url = message.get('params').get('response').get('url') # Get the url of the request if url != 'Interface to get data url': # Use url to judge whether the data packet is the one we want continue try: resp = driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': requestId}) # selenium calls cdp print(f'type: {packet_type} url: {url}') data = resp with open('data.json', 'w') as fp: fp.write(json.dumps(data)) # store data in file break except WebDriverException: # ignore exception pass
Since the captured data packets are loaded asynchronously by the webpage, the loading time is affected by the network, so here I added a while loop until the desired data packets are obtained.
The result comes out after one run. My evaluation is: very easy to use!
The saved data is exactly what I want, very nice!
Option 2 reference source: [Selenium] Selenium gets Network data (advanced version)