Comparing the four common methods of locating elements used by Python crawlers, which one do you prefer?

When using the Python crawler to collect data, a very important operation is how to extract data from the requested web page, and correctly locating the desired data is the first step.

This article will compare the common ways of locating web page elements in several Python crawlers for you to learn

 
  1. Traditional {beautiful soup} operation

  2. CSS selector based on # beautiful soup # (similar to # PyQuery #)

  3. XPath

  4. regular expression

The reference page is Dangdang's best seller list:

http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1

Let's take the title of 20 books on the first page as an example. First determine whether the website can directly return the content to be parsed without anti climbing measures:

import requests

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text
print(response)

Many people learn python and don't know where to start.
After learning python and mastering the basic grammar, many people don't know where to find cases.
Many people who have done cases do not know how to learn more advanced knowledge.
So for these three types of people, I will provide you with a good learning platform, free video tutorials, e-books, and the source code of the course!
QQ group: 810735403

After careful inspection, it is found that the required data are in the returned content, indicating that anti climbing measures do not need to be specially considered

It can be found in the bibliography after li class # information is found in the web page_ list clearfix bang_ list_ Mode , ul ,

Further examination can also find that the title of the book is in the corresponding position, which is an important basis for many analytical methods

1. Traditional beautiful soup operation

The classic BeautifulSoup method uses , from bs4 import BeautifulSoup, and then converts the text into a specific standard structure through , soup = BeautifulSoup(html, "lxml"), which is parsed by , find , series methods. The code is as follows:

import requests
from bs4 import BeautifulSoup

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text

def bs_for_parse(response):
    soup = BeautifulSoup(response, "lxml")
    li_list = soup.find('ul', class_='bang_list clearfix bang_list_mode').find_all('li') #Get 20 li after locking ul
    for li in li_list:
        title = li.find('div', class_='name').find('a')['title'] #Parse the title of the book one by one
        print(title)

if __name__ == '__main__':
    bs_for_parse(response)

Successfully obtained 20 book titles, some of which appear lengthy and can be processed by regular or other string methods. This article will not introduce them in detail

2. CSS selector based on beautiful soup

This method is actually the migration and use of CSS selector in PyQuery in other modules. The usage is similar. For detailed syntax of CSS selector, please refer to: http://www.w3school.com.cn/cssref/css_selectors.asp Because it is based on BeautifulSoup, the imported modules and text structure conversion are consistent:

import requests
from bs4 import BeautifulSoup

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text
        
def css_for_parse(response):
    soup = BeautifulSoup(response, "lxml") 
    print(soup)

if __name__ == '__main__':
    css_for_parse(response)

Then it is through "soup" Select is supplemented by specific CSS syntax to obtain specific content, which is still based on careful review and analysis of elements:

import requests
from bs4 import BeautifulSoup
from lxml import html

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text
        
def css_for_parse(response):
    soup = BeautifulSoup(response, "lxml")
    li_list = soup.select('ul.bang_list.clearfix.bang_list_mode > li')
    for li in li_list:
        title = li.select('div.name > a')[0]['title']
        print(title)

if __name__ == '__main__':
    css_for_parse(response)

3. XPath

XPath is an XML path language. It is a computer language used to determine the location of a part in an XML document. If you use Chrome browser, it is recommended to install the XPath Helper plug-in, which will greatly improve the efficiency of writing XPath.

The previous crawler articles are basically based on XPath, so we are relatively familiar with it, so the code is given directly:

import requests
from lxml import html

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text

def xpath_for_parse(response):
    selector = html.fromstring(response)
    books = selector.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li")
    for book in books:
        title = book.xpath('div[@class="name"]/a/@title')[0]
        print(title)

if __name__ == '__main__':
    xpath_for_parse(response)

4. Regular expression

If you are not familiar with HTML language, the previous parsing methods will be difficult. Here is also a universal parsing method: regular expression. You only need to pay attention to the special construction grammar of the text itself, that is, you can obtain the corresponding content with specific rules. The dependent module is re

First, re observe what is special before and after the required text in the directly returned content:

import requests
import re

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text
print(response)

After observing several numbers, I believe there is an answer: < div class = "name" > < a href=“ http://product.dangdang.com/xxxxxxxx.html " target="_ Blank "title =" XXXXXX "> the title of the book is hidden in the above string, and the number at the end of the URL link will change with the title of the book.

After analysis, the regular expression can be written:

import requests
import re

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text

def re_for_parse(response):
    reg = '<div class="name"><a href="http://product.dangdang.com/\d+.html" target="_blank" title="(.*?)">'
    for title in re.findall(reg, response):
        print(title)

if __name__ == '__main__':
    re_for_parse(response)

You can find that regular writing is the simplest, but you need to be very proficient in regular rules. The so-called regular method is good!

Of course, no matter which method has its applicable scenario, in the real operation, we also need to analyze the web page structure to judge how to locate the elements efficiently. Finally, the complete code of the four methods introduced in this paper is attached. You can operate by yourself to deepen your experience

import requests
from bs4 import BeautifulSoup
from lxml import html
import re

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text

def bs_for_parse(response):
    soup = BeautifulSoup(response, "lxml")
    li_list = soup.find('ul', class_='bang_list clearfix bang_list_mode').find_all('li')
    for li in li_list:
        title = li.find('div', class_='name').find('a')['title']
        print(title)

def css_for_parse(response):
    soup = BeautifulSoup(response, "lxml")
    li_list = soup.select('ul.bang_list.clearfix.bang_list_mode > li')
    for li in li_list:
        title = li.select('div.name > a')[0]['title']
        print(title)

def xpath_for_parse(response):
    selector = html.fromstring(response)
    books = selector.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li")
    for book in books:
        title = book.xpath('div[@class="name"]/a/@title')[0]
        print(title)

def re_for_parse(response):
    reg = '<div class="name"><a href="http://product.dangdang.com/\d+.html" target="_blank" title="(.*?)">'
    for title in re.findall(reg, response):
        print(title)

if __name__ == '__main__':
    # bs_for_parse(response)
    # css_for_parse(response)
    # xpath_for_parse(response)
    re_for_parse(response)

Here, I would like to recommend my own Python development exchange learning (qq) group: 810735403. All of them are learning Python development. If you are learning python, you are welcome to join. Everyone is a software development party and shares dry goods from time to time (only related to Python software development), including a copy of the latest Python advanced materials and advanced development tutorials compiled by myself in 2021. Welcome to advanced and friends who want to go deep into Python!

Tags: Python Programming

Posted by simulant on Mon, 18 Apr 2022 04:01:16 +0930