1.Pyppeteer advantages
- You don't have to configure the browser environment like Selenium
- You can crawl directly on the page. What you crawl is not the page source code, but the page that has been loaded and displayed on the browser
- The encryption system can be bypassed
The text() loaded by pyppeteer is the HTML page after loading. All data are called out. Pyppeteer obtains the loaded web page data:
Request looks at the web source code. There may be JS calls or Ajax interfaces inside, resulting in incomplete code:
2. Crawl JD data
Installation: PIP install pyppeter
Climbing idea:
1. Observe jd.com page
2. Enter "headset" in the search box and click search
3. Grab the data of the first three pages
4. Realize page turning
Total code:
import asyncio # Synergetic process from pyppeteer import launch from pyppeteer_stealth import stealth width, height = 1366, 768 async def main(): browser = await launch(headless=False) page = await browser.newPage() # New tab await stealth(page) # Eliminate fingerprints # Set window size await page.setViewport({'width': width, 'height': height}) await page.goto('https://www.jd. com/? cu=true&utm_ source=baidu-pinzhuan&utm_ medium=cpc&utm_ campaign=t_ 288551095_ baidupinzhuan&utm_ Term=0f3d30c8dba7459bb52f2eb5eba8ac7d_ 0_ C40b68367c9e42489ad40ec69c3a693a ') # go to the target page # Wait for the page to load completely await asyncio.sleep(2) # Locate the search box. The search box id=key await page.waitForSelector('#key', {'timeout': 9000}) # Fill in the search box await page.type('#key ',' headset ') await asyncio.sleep(1) # Wait for 1s # Click the button whose class is button await page.click('.button') await asyncio.sleep(2) # Page turning for num in range(3): # Scroll the browser scroll bar to the bottom await page.evaluate('window.scrollBy(100, document.body.scrollHeight)') await asyncio.sleep(1) # Slide twice to extract more data await page.evaluate('window.scrollBy(200, document.body.scrollHeight)') # //*[@ id="J_goodsList"]/ul/li[1] product Xpath li_list = await page.xpath('//*[@id="J_goodsList"]/ul/li') print(len(li_list)) for i in li_list: # ./div/div[4]/a/em a = await i.xpath('./div/div[4]/a/em') print(a) # get data title = await (await a[0].getProperty("textContent")).jsonValue() print(title) # The next button class is PN next await asyncio.sleep(1) await page.click('.pn-next') print('*'*20) await asyncio.sleep(100) # Execute asynchronously asyncio.get_event_loop().run_until_complete(main())
Code explanation:
1.browser = await launch(headless=False) open the browser
headless=True - headless mode, unable to see the browser crawling process
headless=False -- visual, you can see the browser in the page
2.page = await browser.newPage() new tab
3. Await steal (page) to eliminate fingerprints
Browser fingerprint:
Your browser has a very unique fingerprint. When you visit a website for the first time, the website will record your browser fingerprint on the server and your behavior on the website;
The next time you visit, the website server reads the fingerprint of the browser again, and then compares it with the fingerprint stored before to know whether you have been here and what you did during your last visit.
-----------------
Browser fingerprint threat:
"Browser fingerprint" does not need to save any information on the client, will not be detected by the user, and the user cannot clear it (in other words, you can't even judge whether the website you visit collects browser fingerprint) will be tracked
4.await page. 'height'}: 'height':
5.await page.goto('url ') goes to the target page
6.await asyncio.sleep(2) pauses for two seconds and waits for the page to load
7.await page.waitForSelector('#key', {'timeout': 9000}) locates the target element
Parameter 1: #name locates the element of class=name Name locate the element with id=name
Parameter 2: * * {'timeout': 9000} * * dictionary, the waiting time for locating the target element is 9s
8.await page.type('#key', 'headset') after locating the element, fill in the character in the element
9.await page.click('.button') Click event
10.await page.evaluate('window.scrollBy(100, document.body.scrollHeight) 'page sliding event. You can change the number by adjusting how much you slide
11.await page.xpath('//*[@id="J_goodsList"]/ul/li') gets the data and gets the Json data
As follows:
[<pyppeteer.element_handle.ElementHandle object at 0x0000025D69BB9460>]
12.await (await a[0].getProperty("textContent")).jsonValue() get data
13.asyncio. Get_ event_ Loop() run_ until_ Complete (main()) executes asynchronously