1, What is XPATH? For what?
xpath (XML Path Language) is a language for finding information in XML and HTML documents. It can be used to traverse elements and attributes in XML and HTML documents and to determine the location of a part in XML documents.
At present, browsers have corresponding xpath extensions
- Chrome plugin XPath Helper.
- Firefox plugin Try XPath.
Install plug-ins
The extension program of chrome browser needs to be downloaded over the wall, so I found another installation method, as shown below https://www.cnblogs.com/ubuntu1987/p/11611111.html
. first download the XPath Helper plug-in at the following link: https://pan.baidu.com/s/1Ng7HAGgsVfOyqy6dn094Jg
Extraction code: a1dv
2. After downloading the plug-in, unzip it, and then find 2.0.2 in the unzipped file_ 0.crx file and change its suffix CRX to rar, as shown below
After decompression, open the extension program of Chrome browser, open the developer mode, and load the extracted extension program as follows
4. About using
Restart the browser and click the coordinates in the upper right corner, as shown in the figure below
1. Select node:
XPath uses path expressions to select nodes or node sets in XML documents. These path expressions are very similar to those we see in conventional computer file systems
2. Predicate:
The predicate is used to find a specific node or a node containing a specified value, which is embedded in square brackets. In the following table, we list some path expressions with predicates and the results of the expressions:
Note: subscripts start with 1, not 0.
|Path expression | description |
|/ bookstore/book[1] | select the first child element under the bookstore|
|/ bookstore/book[last()] | select the last book element under the bookstore|
|Bookstore / book [position() < 3] | select the first two sub elements under the bookstore|
|/ / book[@price] | select the book element with the price attribute|
|/ / book[@price=10] | select all book elements whose attribute price is equal to 10|
3. Wildcard
*Indicates a wildcard.
|Wildcard | description | example | result|
| * | match any node | / Bookstore / * | select all child elements under the bookstore|
|@ * | match any attribute in the node | / / book [@ *] | select all book elements with attributes|
4. Select multiple paths:
You can select several paths by using the "|" operator in the path expression. Examples are as follows:
``` //bookstore/book | //book/title
The basic use of xpath of Python crawler (detailed introduction to parsing HTML) can be referred to https://blog.csdn.net/xunxue1523/article/details/104584886
lxml Library
lxml is an HTML/XML parser. Its main function is how to parse and extract HTML/XML data.
lxml, like regular, is also implemented in C. It is a high-performance Python HTML/XML parser. We can use the XPath syntax we learned before to quickly locate specific elements and node information.
Official lxml python documentation: http://lxml.de/ index.html requires the installation of c language library. You can use pip installation: pip install lxml
Basic usage:
Convert to html format through string
Parsing html with lxml
Using etree HTML parsing string
Parse the string from an html file. After processing, some missing nodes can be automatically repaired, and body and html nodes are automatically added.
Python source code:
from lxml import etree text=''' <div class="lg_tbar_l"> <a href="https://www.lagou.com/" class="logo"></a> <ul class="lg_tbar_tabs"> <li > <a href="https://www.lagou.com/" data-lg-tj-id="5i00" data-lg-tj-no="idnull" data-lg-tj-cid="idnull">home page</a> </li> <li > <a href="https://www.lagou.com/gongsi/" data-lg-tj-id="5j00" data-lg-tj-no="idnull" data-lg-tj-cid="idnull" data-lg-tj-track-code="index_company">company</a> </li> <li> <a href="https://xiaoyuan.lagou.com/" data-lg-tj-id="19xc" data-lg-tj-no="idnull" data-lg-tj-cid="idnull" target="_blank" data-lg-tj-track-code="index_campus">Campus Recruitment </a> </li> <li > <a href="https://www.lagou.com/zhaopin/" data-lg-tj-id="4s00" data-lg-tj-no="idnull" data-lg-tj-cid="idnull" data-lg-tj-track-code="index_zhaopin">position </a> </li> <li > <a href="https://yanzhi.lagou.com/" data-lg-tj-id="ic00" data-lg-tj-no="idnull" data-lg-tj-cid="idnull" data-lg-tj-track-code="index_yanzhi">Speech duty</a> </li> <li> <a href="https://kaiwu.lagou.com/" data-lg-tj-id="1mua" data-lg-tj-no="idnull" data-lg-tj-cid="idnull" data-lg-tj-track-code="index_kaiwu" target="_blank">curriculum<span class="tips-new">new</span></a> </li> <li> <a href="https://www.lagou.com/app/download.html" target="_blank">APP</a> </li> </ul> </div> ''' html = etree.HTML(text) result = etree.tostring(html) print(result.decode('utf-8'))
Parsing html files
from lxml import etree html = etree.parse('csdn.html', etree.HTMLParser(encoding='utf-8')) result = etree.tostring(html,encoding='utf-8') print(result.decode('utf-8'))
Get the content of a tag (use Tencent.html for subsequent exercises, see the end of the file)
Note that if you get all the contents of the li tag, you don't need to add a forward slash after li, otherwise an error will be reported
/ / li xpath returns a list
from lxml import etree html = etree.parse('tencent.html', etree.HTMLParser(encoding='utf-8')) html_data = html.xpath('//li') for i in html_data: print(etree.tostring(i,encoding='utf-8').decode('utf-8'))
# 4. Get the values of all class IDs aList = html.xpath('//div[@class="qr-code"]/@id') for a in aList: print ("https://careers.tencent.com/jobdesc.html?postId="+a) # https://careers.tencent.com/jobdesc.html?postId=1318796892787712000 https://careers.tencent.com/jobdesc.html?postId=1318825301697896448 https://careers.tencent.com/jobdesc.html?postId=1318887083753873408 https://careers.tencent.com/jobdesc.html?postId=1308047506302574592 https://careers.tencent.com/jobdesc.html?postId=1329746102546604032 https://careers.tencent.com/jobdesc.html?postId=1329654375789895680 https://careers.tencent.com/jobdesc.html?postId=1242636410821808128 https://careers.tencent.com/jobdesc.html?postId=1328961232517996544 https://careers.tencent.com/jobdesc.html?postId=1275971541930090496 https://careers.tencent.com/jobdesc.html?postId=1285478800217350144
Text acquisition method
There are two methods: one is to obtain the text directly after obtaining the node where the text is located, and the other is to use / /.
The second method will obtain the special characters generated by line feed when completing the code. It is recommended to use the first method to ensure that the obtained results are neat.
# First kind from lxml import etree html_data = html.xpath('//li[@class="item-1"]/a/text()') print(html_data) # Second html_data = html.xpath('//li[@class="item-1"]//text()') print(html_data)
# case: obtain position information and save text
#Under a tag, execute the xpath function to get the descendant elements under the tag #Then a point should be added before / / to indicate that it is obtained under the current element aList = html.xpath('//a[@class="recruit-list-link"]') positions=[] for a in aList: #print(etree.tostring(a,encoding='utf-8').decode('utf-8')) #print ("https://careers.tencent.com/jobdesc.html?postId="+a)# title=a.xpath('./h4//text()')[0] address=a.xpath('./p[@class="recruit-tips"]/span[2]//text()')[0] category=a.xpath('./p[@class="recruit-tips"]/span[3]//text()')[0] time=a.xpath('./p[@class="recruit-tips"]/span[4]//text()')[0] responsibilities=a.xpath('./p[@class="recruit-text"]//text()')[0] position={ 'title':title, 'address':address, 'category':category, 'time':time, 'responsibilities':responsibilities } positions.append(position) print(positions)
Data for exercise https://download.csdn.net/download/sereasuesue/15267619