Python crawler web page element location is in this blog

📢📢📢📢📢📢
Hello! Hello, I'm [dream eraser], with 10 years of research and development experience, and I'm committed to the dissemination of Python related technology stack 💗
🌻 If you think this article is good, use your small hand to praise it 👍
🌻 If you find any errors in this article, please correct them in the comment area 💗
👍 Technology blog day watchers, I like to write articles. If any article is helpful to you, it's very good~ 👍
📣📣📣📣📣📣

Welcome to subscribe to the column ⭐ ️ ⭐ Python crawler 120 ⭐ ️ ⭐ ️

📆 Latest update: the 607th original blog of eraser on April 3, 2022

⛳ Actual combat scenario

As a beginner of Python crawler, nine times out of ten you collect web pages, so quickly locating the content of web pages has become the first obstacle you face. This blog will explain in detail the most easy to use web page element location technology for you, and you will finish the series.

Recently, a series of basic Python crawler contents have been added. I hope you can keep up with the rhythm, play music and dance.

The core of this article is the Beautiful Soup module, so the site we use to test and collect is also its official website (at this stage, crawler collection is becoming more and more strict, many sites can't collect, and it's easy to be blocked. We can only collect whoever we learn)

Official Site

www.crummy.com/software/BeautifulSoup/

Beautiful Soup is very well-known and easy to use in the python crawler circle. It is a python parsing library, which is mainly used to convert HTML tags into Python object trees, and then let's extract data from the object trees.

Module installation and its simplicity:

pip install bs4 -i Any domestic source is enough

For the future installation of any module, try to use domestic sources, which is fast and stable.

The name of the module package is bs4, which requires special attention during installation.

🥇 The basic usage is as follows

import requests
from bs4 import BeautifulSoup


def ret_html():
    """obtain HTML element"""
    res = requests.get('https://www.crummy.com/software/BeautifulSoup/', timeout=3)
    return res.text


if __name__ == '__main__':
    html_str = ret_html()
    soup = BeautifulSoup(html_str, 'lxml')
    print(soup)

What needs to be noted is the module import code and the two parameters passed in the constructor of the beautifulsup class when instantiating the soup object. One is the string to be parsed and the other is the parser. The official recommendation is lxml because of its fast parsing speed.

The output content of the above code is as follows, which looks like an ordinary HTML code file.

And we can call the soup. Of the soup object Prettify () method can format HTML tags, so that you can make their HTML code beautiful when stored in external files.

⛳ Object description of BeautifulSoup module

The beatifulsoup class can parse HTML text into a Python object tree, which includes the four most important objects: Tag, NavigableString, beatifulsoup and Comment objects. Next, we will introduce them one by one.

🥇 BeautifulSoup object

The object itself represents the whole HTML page, and the HTML code will be automatically supplemented when the object is instantiated.

    html_str = ret_html()
    soup = BeautifulSoup(html_str, 'lxml')
    print(type(soup))

🥇 Tag object

Tag means tag. Tag object is a web page tag, or web page element object. For example, get the h1 tag object of bs4 official website. The code is as follows:

if __name__ == '__main__':
    html_str = ret_html()
    soup = BeautifulSoup(html_str, 'lxml')
    # print(soup.prettify())  # Format HTML

    print(soup.h1)

What you get is also the h1 tag in the web page:

<h1>Beautiful Soup</h1>

Use the type function in Python to view its type. The code is as follows:

    print(soup.h1)
    print(type(soup.h1))

Instead of a string, you get a Tag object.

<h1>Beautiful Soup</h1>
<class 'bs4.element.Tag'>

Since it is a Tag object, it will have some specific attribute values

Get tag name

    print(soup.h1)
    print(type(soup.h1))
    print(soup.h1.name)  # Get tag name

Get the attribute value of the Tag through the Tag object

    print(soup.img)  # Get the first img tag of the web page
    print(soup.img['src'])  # Gets the attribute value of the DOM of the web page element

Get all attributes of the tag through attrs attribute

    print(soup.img)  # Get the first img tag of the web page

    print(soup.img.attrs)  # Get all attribute values of web page elements and return them in dictionary form

All the outputs of the above code are as follows. You can choose any label to practice.

<h1>Beautiful Soup</h1>
<class 'bs4.element.Tag'>
h1
<img align="right" src="10.1.jpg" width="250"/>
{'align': 'right', 'src': '10.1.jpg', 'width': '250'}

🥇 NavigableString object

The NavigableString object obtains the text content inside the label, such as p label. In the following code, I am an eraser

<p>I'm an eraser</p>

It's also very easy to get the object. Just use the string attribute of the Tag object.

    nav_obj = soup.h1.string
    print(type(nav_obj))

The output results are as follows

<class 'bs4.element.NavigableString'>

If the target tag is a single tag, None data will be obtained

In addition to using the string method of the object, you can also use the text attribute and get_text() method to get the label content

    print(soup.h1.text)
    print(soup.p.get_text())
    print(soup.p.get_text('&'))

Where text is the combined string to get the contents of all sub tags, and get_text() has the same effect, but use get_text() can add a separator, such as the & symbol of the above code, and can also use the, strip=True parameter to remove spaces.

🥇 Comment object

Get the content of web page notes, which is of little use. Just ignore it.

The BeautifulSoup object and Tag object support Tag lookup methods, as shown below.

⛳ find() method and find_all() method

Call the find() method of BeautifulSoup object and Tag object to find the specified object in the web page. The syntax format of this method is as follows:

obj.find(name,attrs,recursive,text,**kws)

The return result of the method is the first element found. If it is not found, it returns None.
The parameters are described as follows:

  • Name: label name;
  • attrs: tag attribute;
  • recursive: search all descendant elements by default;
  • text: label content.

For example, we continue to find the a tag in the web page requested above. The code is as follows:

html_str = ret_html()
soup = BeautifulSoup(html_str, 'lxml')
print(soup.find('a'))

You can also use the attrs parameter to search. The code is as follows:

html_str = ret_html()
soup = BeautifulSoup(html_str, 'lxml')
# print(soup.find('a'))
print(soup.find(attrs={'class': 'cta'}))

The find() method also provides some special parameters to facilitate direct search. For example, you can use id=xxx to find the tag containing ID in the attribute, and you can use class_=xxx, find the tag containing class in the attribute.

print(soup.find(class_='cta'))

Paired with the find() method is find_all() method. You can know that the returned result is all matching labels by looking at the name. The syntax format is as follows:

obj.find_all(name,attrs,recursive,text,limit)

The limit parameter, which indicates the maximum number of matches returned, is highlighted. The find() method can be regarded as limit=1, which makes it easy to understand.

📣📣📣📣📣📣
🌻 If you find any errors in this article, please correct them in the comment area 💗

Welcome to subscribe to the column ⭐ ️ ⭐ Python crawler 120 ⭐ ️ ⭐ ️

Tags: Python Python crawler

Posted by rn14 on Fri, 15 Apr 2022 09:51:10 +0930