Hello, I'm Xiaoshuai
Today, I'll give you some of the simplest crawler cases. In the future, I'll update some crawler related skill points for you. Please continue to pay attention. In addition, your third company is the biggest support for Xiaoshuai. However, it's a statement that all cases sent by Xiaoshuai are for you to learn. Don't use them indiscriminately or for business!
preface
Our first crawler program is to crawl the names of all tutorial information of a video network. Some of the techniques used in the following code locks are not covered by us. We will explain them one by one in the later study. Here is just to let you have a general understanding of the crawler program, be familiar with the most basic crawler process, and have a rough impression of crawler processing. At the same time, it is also to stimulate your enthusiasm for learning, so that you can deepen your impression of crawler not only in theory, but also in practice.
There are also some small cases, 25 game source codes Click the blue font in front to get it by yourself
1.1 establish imoocspider Py file
The naming of crawler files must be accurate. The crawler will name the website with which it crawls. In this way, we will write more and more crawlers in the future, which will be convenient for management.
After the file is created, first import the requests third-party library and the page parsing tool BeautifulSoup:
import requests # Requests library, which is used to send network requests from bs4 import BeautifulSoup # A parsing library for parsing web page structure
Tips: beautiful soup, which we will talk about later, is just for the first time.
1.2 defining URL variables
Define the url variable url. The url stores the website we want to crawl
url = "https://www.imooc.com "# a course website homepage address
1.3 create request header
Create a request header because the server will distinguish whether the request is a browser or a crawler. If it is a crawler, it will directly disconnect the request, resulting in the failure of the request. In order not to expose our crawler, we should put a layer of camouflage on it, so that the server will think that the browser is requesting:
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36'} # Request header
1.4 initiate request
Use the get method in the requests library to make a request:
r = requests.get(url, headers= headers) # Send request
1.5 parsing request results
Because the result of the request is in HTML format, we use beautiful soup to parse our request result:
bs = BeautifulSoup(r.text, "html.parser") # Parsing web pages
In the returned request result, all the data we want is wrapped in the h3 tag, so we use beautiful soup to find all the h3 tags in the returned result, strip them and store them in the variable mooc_classes.
mooc_classes = bs.find_all("h3", class_="course-card-name") # Locate tutorial information
1.6 parsing data
Peel off the course name in each h3 tag and store it in class_list, and finally save the course information into the text file:
class_list = [] for i in range(len(mooc_classes)): title = mooc_classes[i].text.strip() class_list.append("Course name : {} \n".format(title)) #Formatting tutorial information with open('mooc_classes.txt', "a+") as f: # Write tutorial information to a text file for text in class_list: f.write(text)
1.7 final code
Here is the final code of our little crawler:
Example demonstration
import requests # Requests library, which is used to send network requests from bs4 import BeautifulSoup # This is a parsing library for parsing web pages url = "https://www.......com "# tutorial website homepage address headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36'} # Request header r = requests.get(url, headers= headers) # Send request bs = BeautifulSoup(r.text, "html.parser") # Parsing web pages mooc_classes = bs.find_all("h3", class_="course-card-name") # Locate tutorial information class_list = [] for i in range(len(mooc_classes)): title = mooc_classes[i].text.strip() class_list.append("Course name : {} \n".format(title)) #Formatting tutorial information with open('mooc_classes.txt', "a+") as f: # Write tutorial information to a text file for text in class_list: f.write(text)
**The above program is the simplest crawler program** Finally, we format the output style so that the output format is the style of tutorial name + tutorial network name. Finally, we save the results to a TXT file. Finally, let's open the txt file to see the effect:
Summary
In this section, the simplest crawler program is implemented by using beautiful soup and Requests. We will discuss their learning in detail in the following chapters. This small program is just to show you the most basic workflow of a crawler. Careful students will certainly find that our amount of code is not much, so we can easily realize a simple crawling work. In fact, because of the simplicity and convenience of using Python to write crawler programs, python language has become the first choice to write crawler programs.
↓↓↓↓↓ take the source code of the business card and the learning route at the bottom ↓ ↓ ↓ ↓????