Use of Pyppeteer -- crawling Jingdong

1.Pyppeteer advantages

  • You don't have to configure the browser environment like Selenium
  • You can crawl directly on the page. What you crawl is not the page source code, but the page that has been loaded and displayed on the browser
  • The encryption system can be bypassed

The text() loaded by pyppeteer is the HTML page after loading. All data are called out. Pyppeteer obtains the loaded web page data:

Request looks at the web source code. There may be JS calls or Ajax interfaces inside, resulting in incomplete code:

2. Crawl JD data

Installation: PIP install pyppeter
Climbing idea:
1. Observe page
2. Enter "headset" in the search box and click search
3. Grab the data of the first three pages
4. Realize page turning

Total code:

import asyncio	# Synergetic process
from pyppeteer import launch
from pyppeteer_stealth import stealth

width, height = 1366, 768

async  def main():
    browser = await launch(headless=False)
    page = await browser.newPage()  # New tab
    await stealth(page)  # Eliminate fingerprints
    # Set window size
    await page.setViewport({'width': width, 'height': height})
    await page.goto('https://www.jd. com/? cu=true&utm_ source=baidu-pinzhuan&utm_ medium=cpc&utm_ campaign=t_ 288551095_ baidupinzhuan&utm_ Term=0f3d30c8dba7459bb52f2eb5eba8ac7d_ 0_ C40b68367c9e42489ad40ec69c3a693a ') # go to the target page
    # Wait for the page to load completely
    await asyncio.sleep(2)
    # Locate the search box. The search box id=key
    await page.waitForSelector('#key', {'timeout': 9000})
    # Fill in the search box
    await page.type('#key ',' headset ')
    await asyncio.sleep(1)  # Wait for 1s
    # Click the button whose class is button
    await asyncio.sleep(2)
    # Page turning
    for num in range(3):
        # Scroll the browser scroll bar to the bottom
        await page.evaluate('window.scrollBy(100, document.body.scrollHeight)')
        await asyncio.sleep(1)
        # Slide twice to extract more data
        await page.evaluate('window.scrollBy(200, document.body.scrollHeight)')
        # //*[@ id="J_goodsList"]/ul/li[1] product Xpath
        li_list = await page.xpath('//*[@id="J_goodsList"]/ul/li')
        for i in li_list:
            # ./div/div[4]/a/em
            a = await i.xpath('./div/div[4]/a/em')
            # get data
            title = await (await a[0].getProperty("textContent")).jsonValue()
        # The next button class is PN next
        await asyncio.sleep(1)
    await asyncio.sleep(100)

# Execute asynchronously

Code explanation:
1.browser = await launch(headless=False) open the browser
headless=True - headless mode, unable to see the browser crawling process
headless=False -- visual, you can see the browser in the page = await browser.newPage() new tab
3. Await steal (page) to eliminate fingerprints

Browser fingerprint:
Your browser has a very unique fingerprint. When you visit a website for the first time, the website will record your browser fingerprint on the server and your behavior on the website;
The next time you visit, the website server reads the fingerprint of the browser again, and then compares it with the fingerprint stored before to know whether you have been here and what you did during your last visit.
Browser fingerprint threat:
"Browser fingerprint" does not need to save any information on the client, will not be detected by the user, and the user cannot clear it (in other words, you can't even judge whether the website you visit collects browser fingerprint) will be tracked

4.await page. 'height'}: 'height':
5.await page.goto('url ') goes to the target page
6.await asyncio.sleep(2) pauses for two seconds and waits for the page to load
7.await page.waitForSelector('#key', {'timeout': 9000}) locates the target element
Parameter 1: #name locates the element of class=name Name locate the element with id=name
Parameter 2: * * {'timeout': 9000} * * dictionary, the waiting time for locating the target element is 9s
8.await page.type('#key', 'headset') after locating the element, fill in the character in the element
9.await'.button') Click event
10.await page.evaluate('window.scrollBy(100, document.body.scrollHeight) 'page sliding event. You can change the number by adjusting how much you slide
11.await page.xpath('//*[@id="J_goodsList"]/ul/li') gets the data and gets the Json data
As follows:

[<pyppeteer.element_handle.ElementHandle object at 0x0000025D69BB9460>]

12.await (await a[0].getProperty("textContent")).jsonValue() get data
13.asyncio. Get_ event_ Loop() run_ until_ Complete (main()) executes asynchronously

Tags: crawler

Posted by snapbackz on Thu, 14 Apr 2022 02:26:28 +0930