# 1 Introduction
The previous usage of BeautifulSoup is already a very powerful library, but there are still some popular parsing libraries, such as lxml, which use the Xpath syntax, which is also a relatively efficient parsing method. If you are not used to BeautifulSoup, you can try Xpath
Official website (opens new window) http://lxml.de/index.html
w3c (opens new window) http://www.w3school.com.cn/xpath/index.asp
# 2. Install
copypip install lxml
# 3. XPath syntax
XPath is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in an XML document. XPath is a major element of the W3C XSLT standard, and both XQuery and XPointer are built on top of XPath expressions
# 3.1 Node relationship
- Parent
- Children
- Sibling s
- Ancestor
- Descendant s
# 3.2 Select nodes
# 3.2.1 Common path expressions
expression | describe |
---|---|
nodename | Select all child nodes of this node |
/ | Select from the root node |
// | selects nodes in the document from the current node matching the selection, regardless of their position |
. | Select current node |
.. | Select the parent node of the current node |
@ | select attribute |
# 3.2.2 Wildcards
XPath wildcards can be used to select unknown XML elements.
wildcard | describe | example | result |
---|---|---|---|
* | matches any element node | xpath('div/*') | Get all child nodes under the div |
@* | matches any attribute node | xpath('div[@*]') | Select all div nodes with attributes |
node() | matches any type of node |
# 3.2.3 Select several paths
You can select several paths by using the "|" operator in a path expression
expression | result |
---|---|
xpath('//div|//table') | Get all div and table nodes |
# 3.2.4 Predicates
Predicates are enclosed in square brackets and are used to find a specific node or nodes containing a specified value
expression | result |
---|---|
xpath('/body/div[1]') | Select the first div node under the body |
xpath('/body/div[last()]') | Select the last div node under the body |
xpath('/body/div[last()-1]') | Select the penultimate node under the body |
xpath('/body/div[positon()❤️]') | Select the first three div nodes under the body |
xpath('/body/div[@class]') | Select the div node with the class attribute under the body |
xpath('/body/div[@class="main"]') | Select the div node whose class attribute is main under the body |
xpath('/body/div[price>35.00]') | Select the div node with the price element greater than 35 under the body |
# 3.2.5 XPath operators
operator | describe | example | return value |
---|---|---|---|
Compute two node sets | //book | //cd | |
+ | addition | 6 + 4 | 10 |
– | subtraction | 6 – 4 | 2 |
* | multiplication | 6 * 4 | 24 |
div | division | 8 div 4 | 2 |
= | equal | price=9.80 | Returns true if price is 9.80. Returns false if price is 9.90. |
!= | not equal to | price!=9.80 | Returns true if price is 9.90. Returns false if price is 9.80. |
< | less than | price<9.80 | Returns true if price is 9.00. Returns false if price is 9.90. |
<= | less than or equal to | price<=9.80 | Returns true if price is 9.00. Returns false if price is 9.90. |
> | more than the | price>9.80 | Returns true if price is 9.90. Returns false if price is 9.80. |
>= | greater than or equal to | price>=9.80 | Returns true if price is 9.90. Returns false if price is 9.70. |
or | or | price=9.80 or price=9.70 | Returns true if price is 9.80. Returns false if price is 9.50. |
and | and | price>9.00 and price<9.90 | Returns true if price is 9.80. Returns false if price is 8.50. |
mod | Calculate remainder of division | 5 mod 2 | 1 |
# 3.3 use
# 3.3.1 Small example
copyfrom lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' html = etree.HTML(text) result = etree.tostring(html) print(result)
First we use lxml's etree library, then initialize with etree.HTML, and then we print it out.
Among them, a very practical function of lxml is reflected here, which is to automatically correct the html code. You should have noticed that the last li tag, in fact, I deleted the tail tag, and it is not closed. However, because lxml inherits the features of libxml2, it has the function of automatically correcting HTML code.
So the output is something like this
copy<html><body> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>
Not only the li tag is completed, but also the body and html tags are added. file read
In addition to reading strings directly, it also supports reading content from files. For example, we create a new file called hello.html with the content
copy<div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>
Use the parse method to read the file
copyfrom lxml import etree html = etree.parse('hello.html') result = etree.tostring(html, pretty_print=True) print(result)
You can also get the same result
# 3.3.2 Specific use of XPath
Still the above procedure as an example
- Get all <li> tags
copyfrom lxml import etree html = etree.parse('hello.html') print (type(html)) result = html.xpath('//li') print (result) print (len(result)) print (type(result)) print (type(result[0]))
operation result
copy<type 'lxml.etree._ElementTree'> [<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>] <type 'list'> <type 'lxml.etree._Element'>
It can be seen that the type of etree.parse is ElementTree. After calling xpath, a list is obtained, which contains 5 <li> elements, and each element is of type Element
- Get all class es of <li> tags
copyresult = html.xpath('//li/@class') print (result)
operation result
copy['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']
- Get the <a> tag whose href is link1.html under the <li> tag
copyresult = html.xpath('//li/a[@href="link1.html"]') print (result)
operation result
copy[<Element a at 0x10ffaae18>]
- Get all <span> tags under <li> tags
Note: it is wrong to write
copyresult = html.xpath('//li/span') #Because / is used to get child elements, and <span> is not a child element of <li>, so double slashes are required result = html.xpath('//li//span') print(result)
operation result
copy[<Element span at 0x10d698e18>]
- Get all class es under the <li> tag, excluding <li>
copyresult = html.xpath('//li/a//@class') print (resul)t #operation result ['blod']
- Get the href of the last <li>'s <a>
copyresult = html.xpath('//li[last()]/a/@href') print (result)
operation result
copy['link5.html']
- Get the content of the second-to-last element
copyresult = html.xpath('//li[last()-1]/a') print (result[0].text)
operation result
copyfourth item
- Get the tag name whose class is bold
copyresult = html.xpath('//*[@class="bold"]') print (result[0].tag)
operation result
copyspan
# Select nodes in the XML file:
- element (element node)
- attribute (attribute node)
- text (text node)
- concat( element node, element node)
- comment (comment node)
- root (root node)