Web crawler: grab the ID and link address of Netease cloud music popular songs


Primary goal: capture the ID and link address of 50 popular songs on Netease cloud music through popular singers.
Ultimate goal:
1. Capture the song ID through the popular singer.
2. Capture the comment user ID through the song ID.
3. Send a directed push message by commenting on the user ID.
In order to store the captured results, we use MYSQL to store the results. In this way, we separate each step and connect the data through MYSQL.

Learning content:

Master the preparation of crawler program and the connection and operation of MYSQL database with PYMYSQL in Python.
For example:
1. Build mysql database
2. Master the basic syntax of python connecting to mysql
3. Master the crawler to obtain the specified information
4. Master the url specified by the link

Step 1: MYSQL create table

First, build a mysql environment, create a database called python name, and then create a table in this python to store the captured song ID, song name, and the corresponding web page address.
The stored results are as follows:

Create table statement:

create database python
ALTER DATABASE python CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci  

CREATE TABLE `songinf`  (
  `song_id` varchar(30) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `song_name` varchar(1000) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `song_url` varchar(150) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `clbz` varchar(1) CHARACTER SET utf8 COLLATE utf8_general_ci ,
  `height` float(3, 2) NULL DEFAULT 0.00,  
  INDEX `song_id`(`song_id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;

Step 2: operate mysql through pysql

After the table is created, we test whether we can insert a piece of data into the songIf table just created by pysql. If it can be inserted, the connection is successful.
Create a python file named wangyiyunpipersql py. In step 3, import the file.

#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'

import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
    def __init__(self):
        self.conn = pymysql.Connect(host='',
                                    #Use utf8mb4 this format when creating database format, because it can store non characters such as expressions
        self.cursor = self.conn.cursor()

    def modify_sql(self,sql,data):

    def __del__(self):

def insert_songinf(song_id,song_name,song_url,clbz):
    helper = Mysql_pq()
    print('Connected to the database python´╝îReady to insert song information')
    # insert data
    insert_sql = 'insert into songinf(song_id,song_name,song_url,clbz) value (%s,%s,%s,%s)'
    data = (song_id,song_name,song_url,clbz)
    helper.modify_sql(insert_sql, data)

if __name__ == '__main__':
    song_name= 'Please remember me'
    song_url = 'https://music.163.com/#/song?id=1313052960&lv=-1&kv=-1&tv=-1'
    clbz = 'N'
    insert_songinf(song_id, song_name, song_url,clbz)
    print('test over')

Run this program, and then check whether the table songIf has data through DbVisualizer.

Step 3: get song information through crawler

Core operation: visit the singer's home page through the singer ID, then obtain the top 50 popular songs displayed in the home page, obtain the song name and ID, and piece together the song page at the same time.

The home page format of each singer is the same, but the ID is different. The forms are:

url = 'https://music.163.com/artist?id=' + artist_id

Therefore, as long as we can get the singer's id, we can get its home page address.
The singer ID can also be obtained through the crawler, but the Netease cloud music interface is developed through the iframe framework, so it is still difficult to obtain it. Debugging takes a lot of time. It takes two days to debug the crawler program to obtain the user ID during the Chinese New Year. There is no waste of time here. Human flesh crawls to get the ID of the desired singer, which can be obtained by visiting the home page address of these singers. For example:

Xu Binglong 1197168 Zhou Bichang 10558 also used coupons to buy Pepper 12085562 brilliance Yu 861777 Lin Yujia 3685
 Li Ronghao 4292 Aska Yang 6066 Joker Xue 5781 Tanya Chua 7214 Jin Wen Qi 893259 JJ Lin 3684

Therefore, the new python program is as follows:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'

# -*- coding:utf-8 -*-
# Netease cloud music generates the singer's popular songs and song home page through the singer ID
import requests
import sys
import re
import os
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
from lxml import etree
from selenium import webdriver
from wangyiyunSpiderSQL import *

headers = {
    'Referer': 'http://music.163.com',
    'Host': 'music.163.com',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'User-Agent': 'Chrome/10'

# Get the song ID and song name of the top 50 popular songs on the specified singer page
def get_songs(artist_id):
    page_url = 'https://music.163.com/artist?id=' + artist_id
    # Get web page HTML
    res = requests.request('GET', page_url, headers=headers)
    # Parsing the top 50 popular songs with XPath
    html = etree.HTML(res.text)
    href_xpath = "//*[@id='hotsong-list']//a/@href"
    name_xpath = "//*[@id='hotsong-list']//a/text()"
    hrefs = html.xpath(href_xpath)
    names = html.xpath(name_xpath)
    # Set the ID and name of popular songs
    song_ids = []
    song_names = []
    for href, name in zip(hrefs, names):
        print(href, '  ', name)
    return song_ids, song_names

# Set the singer ID to 12138269
#Xu Binglong 1197168 Zhou Bichang 10558 also used coupons to buy Pepper 12085562 brilliance Yu 861777 Lin Yujia 3685
# Li Ronghao 4292 Aska Yang 6066 Joker Xue 5781 Tanya Chua 7214 Jin Wen Qi 893259 JJ Lin 3684
# Deng Ziqi 7763 Sun Yanzi 9272 Liang Jingru 8325 Zhang Huimei 10559 Lin Yilian 8336 Mo Wenwei 8926
# Zhao Lei 6731 song Dongye 5073 Ma Yu 4592 Pu Shu 4721 escape plan 12977 Huang Xiaoyun
# 14077324 Eason Chan 2116 AI Chen 12174057 Feng Mingju 12172529 Ruan Dou 12172496 black cat
# 12383659Fine Orchestra 1160085 Guo Ding 2843 Zhou Xingzhe 980025 Tian Fuzhen 9548 may day
# 13193 soda green 12707 Wang Leehom 5346 Tao zhe 5196 Zhou Jielun 6452 Zhou Huajian 6456

artist_id_list = ['12138269','1197168','10558','12085562','861777','3685','4292',

for artist_id in artist_id_list:

    [song_ids, song_names] = get_songs(artist_id)
    print('len(song_ids) = ',len(song_ids))

    for (song_id, song_name) in zip(song_ids, song_names):
        # Lyrics API URL
        song_url = 'https://music.163.com/#/song?id=' + song_id + '&lv=-1&kv=-1&tv=-1'
        print('song_url = ',song_url)
        print('song_id = ', song_id)
        print('song_name =', song_name)
        clbz = 'N'
        insert_songinf(song_id, song_name, song_url, clbz)
    print('Insert song id end')
    #lyric = get_song_lyric(headers, song_url)
print('Insert song id All successful, end')

Run the program to get the ID and link address of 2613 songs.
Next, we will use this song ID to get the ID of the comment user

Tags: MySQL crawler Data Analysis

Posted by yogibear333 on Mon, 18 Apr 2022 09:07:04 +0930