When using the Python crawler to collect data, a very important operation is how to extract data from the requested web page, and correctly locating the desired data is the first step.
This article will compare the common ways of locating web page elements in several Python crawlers for you to learn
-
Traditional {beautiful soup} operation
-
CSS selector based on # beautiful soup # (similar to # PyQuery #)
-
XPath
-
regular expression
The reference page is Dangdang's best seller list:
http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1
Let's take the title of 20 books on the first page as an example. First determine whether the website can directly return the content to be parsed without anti climbing measures:
import requests url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' response = requests.get(url).text print(response)
Many people learn python and don't know where to start.
After learning python and mastering the basic grammar, many people don't know where to find cases.
Many people who have done cases do not know how to learn more advanced knowledge.
So for these three types of people, I will provide you with a good learning platform, free video tutorials, e-books, and the source code of the course!
QQ group: 810735403
After careful inspection, it is found that the required data are in the returned content, indicating that anti climbing measures do not need to be specially considered
It can be found in the bibliography after li class # information is found in the web page_ list clearfix bang_ list_ Mode , ul ,
Further examination can also find that the title of the book is in the corresponding position, which is an important basis for many analytical methods
1. Traditional beautiful soup operation
The classic BeautifulSoup method uses , from bs4 import BeautifulSoup, and then converts the text into a specific standard structure through , soup = BeautifulSoup(html, "lxml"), which is parsed by , find , series methods. The code is as follows:
import requests from bs4 import BeautifulSoup url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' response = requests.get(url).text def bs_for_parse(response): soup = BeautifulSoup(response, "lxml") li_list = soup.find('ul', class_='bang_list clearfix bang_list_mode').find_all('li') #Get 20 li after locking ul for li in li_list: title = li.find('div', class_='name').find('a')['title'] #Parse the title of the book one by one print(title) if __name__ == '__main__': bs_for_parse(response)
Successfully obtained 20 book titles, some of which appear lengthy and can be processed by regular or other string methods. This article will not introduce them in detail
2. CSS selector based on beautiful soup
This method is actually the migration and use of CSS selector in PyQuery in other modules. The usage is similar. For detailed syntax of CSS selector, please refer to: http://www.w3school.com.cn/cssref/css_selectors.asp Because it is based on BeautifulSoup, the imported modules and text structure conversion are consistent:
import requests from bs4 import BeautifulSoup url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' response = requests.get(url).text def css_for_parse(response): soup = BeautifulSoup(response, "lxml") print(soup) if __name__ == '__main__': css_for_parse(response)
Then it is through "soup" Select is supplemented by specific CSS syntax to obtain specific content, which is still based on careful review and analysis of elements:
import requests from bs4 import BeautifulSoup from lxml import html url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' response = requests.get(url).text def css_for_parse(response): soup = BeautifulSoup(response, "lxml") li_list = soup.select('ul.bang_list.clearfix.bang_list_mode > li') for li in li_list: title = li.select('div.name > a')[0]['title'] print(title) if __name__ == '__main__': css_for_parse(response)
3. XPath
XPath is an XML path language. It is a computer language used to determine the location of a part in an XML document. If you use Chrome browser, it is recommended to install the XPath Helper plug-in, which will greatly improve the efficiency of writing XPath.
The previous crawler articles are basically based on XPath, so we are relatively familiar with it, so the code is given directly:
import requests from lxml import html url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' response = requests.get(url).text def xpath_for_parse(response): selector = html.fromstring(response) books = selector.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li") for book in books: title = book.xpath('div[@class="name"]/a/@title')[0] print(title) if __name__ == '__main__': xpath_for_parse(response)
4. Regular expression
If you are not familiar with HTML language, the previous parsing methods will be difficult. Here is also a universal parsing method: regular expression. You only need to pay attention to the special construction grammar of the text itself, that is, you can obtain the corresponding content with specific rules. The dependent module is re
First, re observe what is special before and after the required text in the directly returned content:
import requests import re url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' response = requests.get(url).text print(response)
After observing several numbers, I believe there is an answer: < div class = "name" > < a href=“ http://product.dangdang.com/xxxxxxxx.html " target="_ Blank "title =" XXXXXX "> the title of the book is hidden in the above string, and the number at the end of the URL link will change with the title of the book.
After analysis, the regular expression can be written:
import requests import re url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' response = requests.get(url).text def re_for_parse(response): reg = '<div class="name"><a href="http://product.dangdang.com/\d+.html" target="_blank" title="(.*?)">' for title in re.findall(reg, response): print(title) if __name__ == '__main__': re_for_parse(response)
You can find that regular writing is the simplest, but you need to be very proficient in regular rules. The so-called regular method is good!
Of course, no matter which method has its applicable scenario, in the real operation, we also need to analyze the web page structure to judge how to locate the elements efficiently. Finally, the complete code of the four methods introduced in this paper is attached. You can operate by yourself to deepen your experience
import requests from bs4 import BeautifulSoup from lxml import html import re url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' response = requests.get(url).text def bs_for_parse(response): soup = BeautifulSoup(response, "lxml") li_list = soup.find('ul', class_='bang_list clearfix bang_list_mode').find_all('li') for li in li_list: title = li.find('div', class_='name').find('a')['title'] print(title) def css_for_parse(response): soup = BeautifulSoup(response, "lxml") li_list = soup.select('ul.bang_list.clearfix.bang_list_mode > li') for li in li_list: title = li.select('div.name > a')[0]['title'] print(title) def xpath_for_parse(response): selector = html.fromstring(response) books = selector.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li") for book in books: title = book.xpath('div[@class="name"]/a/@title')[0] print(title) def re_for_parse(response): reg = '<div class="name"><a href="http://product.dangdang.com/\d+.html" target="_blank" title="(.*?)">' for title in re.findall(reg, response): print(title) if __name__ == '__main__': # bs_for_parse(response) # css_for_parse(response) # xpath_for_parse(response) re_for_parse(response)