Data Extraction - XPath

# 1 Introduction

The previous usage of BeautifulSoup is already a very powerful library, but there are still some popular parsing libraries, such as lxml, which use the Xpath syntax, which is also a relatively efficient parsing method. If you are not used to BeautifulSoup, you can try Xpath

Official website (opens new window) http://lxml.de/index.html

w3c (opens new window) http://www.w3school.com.cn/xpath/index.asp

# 2. Install

pip install lxml
copy

# 3. XPath syntax

XPath is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in an XML document. XPath is a major element of the W3C XSLT standard, and both XQuery and XPointer are built on top of XPath expressions

# 3.1 Node relationship

  • Parent
  • Children
  • Sibling s
  • Ancestor
  • Descendant s

# 3.2 Select nodes

# 3.2.1 Common path expressions

expression

describe

nodename

Select all child nodes of this node

/

Select from the root node

//

selects nodes in the document from the current node matching the selection, regardless of their position

.

Select current node

..

Select the parent node of the current node

@

select attribute

# 3.2.2 Wildcards

XPath wildcards can be used to select unknown XML elements.

wildcard

describe

example

result

*

matches any element node

xpath('div/*')

Get all child nodes under the div

@*

matches any attribute node

xpath('div[@*]')

Select all div nodes with attributes

node()

matches any type of node

# 3.2.3 Select several paths

You can select several paths by using the "|" operator in a path expression

expression

result

xpath('//div|//table')

Get all div and table nodes

# 3.2.4 Predicates

Predicates are enclosed in square brackets and are used to find a specific node or nodes containing a specified value

expression

result

xpath('/body/div[1]')

Select the first div node under the body

xpath('/body/div[last()]')

Select the last div node under the body

xpath('/body/div[last()-1]')

Select the penultimate node under the body

xpath('/body/div[positon()❤️]')

Select the first three div nodes under the body

xpath('/body/div[@class]')

Select the div node with the class attribute under the body

xpath('/body/div[@class="main"]')

Select the div node whose class attribute is main under the body

xpath('/body/div[price>35.00]')

Select the div node with the price element greater than 35 under the body

# 3.2.5 XPath operators

operator

describe

example

return value

Compute two node sets

//book

//cd

+

addition

6 + 4

10

subtraction

6 – 4

2

*

multiplication

6 * 4

24

div

division

8 div 4

2

=

equal

price=9.80

Returns true if price is 9.80. Returns false if price is 9.90.

!=

not equal to

price!=9.80

Returns true if price is 9.90. Returns false if price is 9.80.

<

less than

price<9.80

Returns true if price is 9.00. Returns false if price is 9.90.

<=

less than or equal to

price<=9.80

Returns true if price is 9.00. Returns false if price is 9.90.

>

more than the

price>9.80

Returns true if price is 9.90. Returns false if price is 9.80.

>=

greater than or equal to

price>=9.80

Returns true if price is 9.90. Returns false if price is 9.70.

or

or

price=9.80 or price=9.70

Returns true if price is 9.80. Returns false if price is 9.50.

and

and

price>9.00 and price<9.90

Returns true if price is 9.80. Returns false if price is 8.50.

mod

Calculate remainder of division

5 mod 2

1

# 3.3 use

# 3.3.1 Small example
from lxml import etree
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result)
copy

First we use lxml's etree library, then initialize with etree.HTML, and then we print it out.

Among them, a very practical function of lxml is reflected here, which is to automatically correct the html code. You should have noticed that the last li tag, in fact, I deleted the tail tag, and it is not closed. However, because lxml inherits the features of libxml2, it has the function of automatically correcting HTML code.

So the output is something like this

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>

</body></html>
copy

Not only the li tag is completed, but also the body and html tags are added. file read

In addition to reading strings directly, it also supports reading content from files. For example, we create a new file called hello.html with the content

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
copy

Use the parse method to read the file

from lxml import etree
html = etree.parse('hello.html')
result = etree.tostring(html, pretty_print=True)
print(result)
copy

You can also get the same result

# 3.3.2 Specific use of XPath

Still the above procedure as an example

  1. Get all <li> tags
from lxml import etree
html = etree.parse('hello.html')
print (type(html))
result = html.xpath('//li')
print (result)
print (len(result))
print (type(result))
print (type(result[0]))
copy

operation result

<type 'lxml.etree._ElementTree'>
[<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>]

<type 'list'>
<type 'lxml.etree._Element'>
copy

It can be seen that the type of etree.parse is ElementTree. After calling xpath, a list is obtained, which contains 5 <li> elements, and each element is of type Element

  1. Get all class es of <li> tags
result = html.xpath('//li/@class')
print (result)
copy

operation result

['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']
copy
  1. Get the <a> tag whose href is link1.html under the <li> tag
result = html.xpath('//li/a[@href="link1.html"]')
print (result)
copy

operation result

[<Element a at 0x10ffaae18>]
copy
  1. Get all <span> tags under <li> tags

Note: it is wrong to write

result = html.xpath('//li/span')

#Because / is used to get child elements, and <span> is not a child element of <li>, so double slashes are required
result = html.xpath('//li//span')
print(result)
copy

operation result

[<Element span at 0x10d698e18>]
copy
  1. Get all class es under the <li> tag, excluding <li>
result = html.xpath('//li/a//@class')
print (resul)t
#operation result
['blod']
copy
  1. Get the href of the last <li>'s <a>
result = html.xpath('//li[last()]/a/@href')
print (result)
copy

operation result

['link5.html']
copy
  1. Get the content of the second-to-last element
result = html.xpath('//li[last()-1]/a')
print (result[0].text)
copy

operation result

fourth item
copy
  1. Get the tag name whose class is bold
result = html.xpath('//*[@class="bold"]')
print (result[0].tag)
copy

operation result

span
copy

# Select nodes in the XML file:

  • element (element node)
  • attribute (attribute node)
  • text (text node)
  • concat( element node, element node)
  • comment (comment node)
  • root (root node)

Tags: html xml

Posted by ediehl on Thu, 22 Dec 2022 00:21:53 +1030