Using python to crawl microblog comment data, the road of crawler is endless.. (source code attached)

Using python to crawl microblog comment data, the road of crawler is endless.. (source code attached)

Today, the target crawls the comment information of any blog post on Weibo

Tool use

Development environment: win10, python3 six
Development tool: pycharm
Toolkit: requests, re, time, random, tkinter

Analysis of project ideas

1. Log in to the web version and retrieve the cookie value
2. Select the web site of the blog comment information to climb
https://weibo.com/3167104922/Kkl7ar83T#comment take as an example
3. Retrieve the unique ID value (weibo_id) of the blog according to the address of the web version
4. Construct the address of the comment request of the blog mobile version
f'https://m.weibo.cn/comments/hotflow?id={weibo_id}&mid={weibo_id}&max_id_type=0'
5. Send the request to retrieve the json data of the response
6.max_id and Max_ id_ The value of type is determined
7. Construct the data parameter, and add the parameter in the next page turning request

"""structure GET Request parameters"""
        data = {
            'id': weibo_id,
            'mid': weibo_id,
            'max_id': max_id,
            'max_id_type': max_id_type
        }

8.max_id is the page turning rule of the previous package
9. Then continue to analyze the data, obtain the content of the comment information, and then turn the page and call back all the time.

Log in at the starting address first

https://weibo.com/
After logging in, click to open a blog post, click comments, and click to view more comments
In this paper https://weibo.com/3167104922/Kkl7ar83T#comment take as an example

search weibo_id, construct the requests url in the headers, send the request and get Weibo back_ Value of ID

Enter the mobile version mode for XHR packet capture


data parameters for page turning

"""structure GET Request parameters"""
        data = {
            'id': weibo_id,
            'mid': weibo_id,
            'max_id': max_id,
            'max_id_type': max_id_type
        }

Let's start with a code operation

This code needs to prepare the cookie and blog page address after login

It's embarrassing to report a mistake... Grab only one page of comment data

I continued to drop down the comments directly. On the second page, I needed to log in directly. I had just logged in, and the web page was not closed, so I was asked to log in. Therefore, I need a cookie pool.
Later, it was found that the problem lies in that the crawling of microblog comments requires multiple accounts and multiple ua, and the restrictions have changed

Then I used this random sampling method to try to break through



It still failed...
I refer to the usage of cookie pool on the Internet later. I feel confused. Hey hey
You can refer to it, https://www.cnblogs.com/ciquankun/p/13329252.html

Of course, you guys have a good way to tell me secretly

Source code display:

from datetime import datetime
from requests_html import HTMLSession
import re, time
import tkinter as tk
session = HTMLSession()


class WBSpider(object):

    def __init__(self):

        """Define the visualization window and set the size and layout of the window and theme"""
        self.window = tk.Tk()
        self.window.title('Microblog comment information collection')
        self.window.geometry('800x600')

        """establish label_user Buttons, and instructions"""
        self.label_user = tk.Label(self.window, text='Please enter the address of the microblog comment you want to crawl:', font=('Arial', 12), width=30, height=2)
        self.label_user.pack()
        """establish label_user Associated input"""
        self.entry_user = tk.Entry(self.window, show=None, font=('Arial', 14))
        self.entry_user.pack(after=self.label_user)

        """establish label_passwd Buttons, and instructions"""
        self.label_passwd = tk.Label(self.window, text="Please enter the password after login cookie: ", font=('Arial', 12), width=30, height=2)
        self.label_passwd.pack()
        """establish label_passwd Associated input"""
        self.entry_passwd = tk.Entry(self.window, show=None, font=('Arial', 14))
        self.entry_passwd.pack(after=self.label_passwd)

        """establish Text The rich text box is used to display the button operation results"""
        self.text1 = tk.Text(self.window, font=('Arial', 12), width=85, height=22)
        self.text1.pack()

        """Define button 1, bind trigger event method"""

        self.button_1 = tk.Button(self.window, text='Crawling', font=('Arial', 12), width=10, height=1,
                                  command=self.parse_hit_click_1)
        self.button_1.pack(before=self.text1)

        """Define button 2 and bind the trigger event method"""
        self.button_2 = tk.Button(self.window, text='eliminate', font=('Arial', 12), width=10, height=1,
                                  command=self.parse_hit_click_2)
        self.button_2.pack(anchor="e")



    def parse_hit_click_1(self):
        """Define trigger event 1,call main function"""
        user_url = self.entry_user.get()
        pass_wd = self.entry_passwd.get()
        self.main(user_url, pass_wd)


    def main(self, user_url, pass_wd):
        i = 1

        headers_1 = {
            'cookie': pass_wd,
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'

        }
        headers_2 ={
            'cookie': pass_wd,
            'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Mobile Safari/537.36'
        }
        uid_1 = re.findall('/(.*?)#', user_url)[0]
        uid_2 = uid_1.split('/', 3)[3]
        # print(uid_2)

        url_1 = f'https://weibo.com/ajax/statuses/show?id={uid_2}'
        response = session.get(url_1, headers=headers_1).content.decode()
        # print(response)
        weibo_id = re.findall('"id":(.*?),"idstr"', response)[0]
        # print(weibo_id)
        # Construction start address
        start_url = f'https://m.weibo.cn/comments/hotflow?id={weibo_id}&mid={weibo_id}&max_id_type=0'
        """
                2.Send request and get response: the beginning of parsing url address
                :return:
                """
        
        response = session.get(start_url, headers=headers_2).json()
        
        """Extract flipped max_id"""
        max_id = response['data']['max_id']
        """Extract flipped max_id_type"""
        max_id_type = response['data']['max_id_type']
        
        """structure GET Request parameters"""
        data = {
            'id': weibo_id,
            'mid': weibo_id,
            'max_id': max_id,
            'max_id_type': max_id_type
        }
        """Analyze comments"""
        self.parse_response_data(response, i)
        """Parameter passing, method callback"""
        self.parse_page_func(data, weibo_id, headers_2, i)

    def parse_page_func(self, data, weibo_id, headers_2, i):
        """
        :return:
        """

        start_url = 'https://m.weibo.cn/comments/hotflow?'
        response = session.get(start_url, headers=headers_2, params=data).json()
        """Extract flipped max_id"""
        max_id = response['data']['max_id']
        """Extract flipped max_id_type"""
        max_id_type = response['data']['max_id_type']
        """structure GET Request parameters"""
        data = {
            'id': weibo_id,
            'mid': weibo_id,
            'max_id': max_id,
            'max_id_type': max_id_type
        }
        """Analyze comments"""
        self.parse_response_data(response, i)
        """Recursive callback"""
        self.parse_page_func(data, weibo_id, headers_2, i)

    def parse_response_data(self, response, i):
        """
        Extract comments from responses
        :return:
        """
        """Extract a large list of comments"""
        data_list = response['data']['data']
        # print(data_list)
        for data_json_dict in data_list:
            # Extract comments
            try:
                texts_1 = data_json_dict['text']
                """need sub Replace label content"""
                # The content to be replaced, the content after replacement, and the replacement object
                alts = ''.join(re.findall(r'alt=(.*?) ', texts_1))
                texts = re.sub("<span.*?</span>", alts, texts_1)
                # Praise quantity
                like_counts = str(data_json_dict['like_count'])
                # Comment time: Greenwich mean time --- it needs to be converted into Beijing time
                created_at = data_json_dict['created_at']
                std_transfer = '%a %b %d %H:%M:%S %z %Y'
                std_create_times = str(datetime.strptime(created_at, std_transfer))
                # Gender is extracted by f
                gender = data_json_dict['user']['gender']
                genders = 'female' if gender == 'f' else 'male'
                # user name
                screen_names = data_json_dict['user']['screen_name']

                print(screen_names, genders, std_create_times, texts, like_counts)
                print()
            except Exception as e:
                continue
        print('*******************************************************************************************')
        print()
        print(f'*****The first{i}Page comments print complete*****')
        i+=1


    def parse_hit_click_2(self):
        """Define trigger event 2 and delete the content in the text box"""
        self.entry_user.delete(0, "end")
        self.entry_passwd.delete(0, "end")
        self.text1.delete("1.0", "end")

    def center(self):
        """Method of creating window centering function"""
        ws = self.window.winfo_screenwidth()
        hs = self.window.winfo_screenheight()
        x = int((ws / 2) - (800 / 2))
        y = int((hs / 2) - (600 / 2))
        self.window.geometry('{}x{}+{}+{}'.format(800, 600, x, y))

    def run_loop(self):
        """Prohibit modification of form size specification"""
        self.window.resizable(False, False)
        """Center window"""
        self.center()
        """Window maintenance--Persistence"""
        self.window.mainloop()



if __name__ == '__main__':
    w = WBSpider()
    w.run_loop()


The code is not perfect. You are welcome to give me directions, break through and turn the page

I wish you all success in learning python!

Tags: Python crawler regex time

Posted by wilhud on Mon, 31 Jan 2022 14:13:05 +1030