Crawler data extraction using XPath and lxml is necessary for beginners

1, What is XPATH? For what?

xpath (XML Path Language) is a language for finding information in XML and HTML documents. It can be used to traverse elements and attributes in XML and HTML documents and to determine the location of a part in XML documents.

At present, browsers have corresponding xpath extensions

  1. Chrome plugin XPath Helper.
  2. Firefox plugin Try XPath.

Install plug-ins

The extension program of chrome browser needs to be downloaded over the wall, so I found another installation method, as shown below

. first download the XPath Helper plug-in at the following link:
Extraction code: a1dv
2. After downloading the plug-in, unzip it, and then find 2.0.2 in the unzipped file_ 0.crx file and change its suffix CRX to rar, as shown below

After decompression, open the extension program of Chrome browser, open the developer mode, and load the extracted extension program as follows

4. About using

Restart the browser and click the coordinates in the upper right corner, as shown in the figure below


1. Select node:

XPath uses path expressions to select nodes or node sets in XML documents. These path expressions are very similar to those we see in conventional computer file systems

2. Predicate:

The predicate is used to find a specific node or a node containing a specified value, which is embedded in square brackets. In the following table, we list some path expressions with predicates and the results of the expressions:

Note: subscripts start with 1, not 0.

|Path expression | description |

|/ bookstore/book[1] | select the first child element under the bookstore|

|/ bookstore/book[last()] | select the last book element under the bookstore|

|Bookstore / book [position() < 3] | select the first two sub elements under the bookstore|

|/ / book[@price] | select the book element with the price attribute|

|/ / book[@price=10] | select all book elements whose attribute price is equal to 10|


3. Wildcard

*Indicates a wildcard.

|Wildcard | description | example | result|

| * | match any node | / Bookstore / * | select all child elements under the bookstore|

|@ * | match any attribute in the node | / / book [@ *] | select all book elements with attributes|


4. Select multiple paths:

You can select several paths by using the "|" operator in the path expression. Examples are as follows:

``` //bookstore/book | //book/title

The basic use of xpath of Python crawler (detailed introduction to parsing HTML) can be referred to

lxml Library

lxml is an HTML/XML parser. Its main function is how to parse and extract HTML/XML data.
lxml, like regular, is also implemented in C. It is a high-performance Python HTML/XML parser. We can use the XPath syntax we learned before to quickly locate specific elements and node information.
Official lxml python documentation: index.html requires the installation of c language library. You can use pip installation: pip install lxml


Basic usage:

Convert to html format through string

Parsing html with lxml
Using etree HTML parsing string
Parse the string from an html file. After processing, some missing nodes can be automatically repaired, and body and html nodes are automatically added.
Python source code:

from lxml import etree

<div class="lg_tbar_l">
            <a href="" class="logo"></a>
            <ul class="lg_tbar_tabs">
                                <li >
                    <a href="" data-lg-tj-id="5i00"
                        data-lg-tj-no="idnull" data-lg-tj-cid="idnull">home page</a>
                <li >
                    <a href="" data-lg-tj-id="5j00"
                        data-lg-tj-no="idnull" data-lg-tj-cid="idnull" data-lg-tj-track-code="index_company">company</a>
                    <a href="" data-lg-tj-id="19xc" data-lg-tj-no="idnull"
                        data-lg-tj-cid="idnull" target="_blank" data-lg-tj-track-code="index_campus">Campus Recruitment
                <li >
                    <a href="" data-lg-tj-id="4s00" data-lg-tj-no="idnull"
                        data-lg-tj-cid="idnull" data-lg-tj-track-code="index_zhaopin">position
                <li >
                    <a href=""
                        data-lg-tj-id="ic00" data-lg-tj-no="idnull" data-lg-tj-cid="idnull"
                        data-lg-tj-track-code="index_yanzhi">Speech duty</a>
                    <a href="" data-lg-tj-id="1mua" data-lg-tj-no="idnull"
                        data-lg-tj-cid="idnull" data-lg-tj-track-code="index_kaiwu" target="_blank">curriculum<span class="tips-new">new</span></a>
                    <a href="" target="_blank">APP</a>
html = etree.HTML(text)
result = etree.tostring(html)

Parsing html files

from lxml import etree

html = etree.parse('csdn.html', etree.HTMLParser(encoding='utf-8'))
result = etree.tostring(html,encoding='utf-8')

Get the content of a tag (use Tencent.html for subsequent exercises, see the end of the file)

Note that if you get all the contents of the li tag, you don't need to add a forward slash after li, otherwise an error will be reported

/ / li xpath returns a list

from lxml import etree

html = etree.parse('tencent.html', etree.HTMLParser(encoding='utf-8'))
html_data = html.xpath('//li')
for i in html_data:
# 4. Get the values of all class IDs
aList = html.xpath('//div[@class="qr-code"]/@id')
for a in aList:
    print (""+a)


Text acquisition method

There are two methods: one is to obtain the text directly after obtaining the node where the text is located, and the other is to use / /.
The second method will obtain the special characters generated by line feed when completing the code. It is recommended to use the first method to ensure that the obtained results are neat.

# First kind
from lxml import etree

html_data = html.xpath('//li[@class="item-1"]/a/text()')
# Second
html_data = html.xpath('//li[@class="item-1"]//text()')

# case: obtain position information and save text

#Under a tag, execute the xpath function to get the descendant elements under the tag
#Then a point should be added before / / to indicate that it is obtained under the current element

aList = html.xpath('//a[@class="recruit-list-link"]')
for a in aList:
    #print (""+a)#

Data for exercise

Tags: crawler Python crawler xpath xml

Posted by Archbob on Tue, 19 Apr 2022 07:17:57 +0930