Unexpected to see a video on a station, it is very interesting to practice.This article is only written as a personal learning note. If there are any deficiencies, please correct them.
Say nothing but the source code and the effect map (read this article if it helps you):
Page Links:https://music.163.com/#/playlist?id=5087806619
Design sketch:
Source code:
# -*- codeing=utf-8 -*- # @Time:2021/7/22 20:47 # @Atuhor:@lwtyh # @File: Bulk Download.py # @Software:PyCharm #Import Framework (Library, Module) pip install xxxx import requests from lxml import etree # http://music.163.com/song/media/outer/url?id= # 1. Determine the real address of the web address on the network--Doc url = 'https://music.163.com/playlist?id=5087806619' base_url = 'http://music.163.com/song/media/outer/url?id=' # 2. requests Picture, Video, Audio content String text html_str = requests.get(url).text # print(type(html_str)) # String type # 3. Filter Data XPath (Label Language) # //a[contains(@href,'/song?')]/@href result = etree.HTML(html_str) # Conversion Type # print(type(result)) song_ids = result.xpath('//a[contains(@href,"/song?")]/@href') #song id song_names = result.xpath('//a[contains(@href,"/song?")]/text()') #Song Name # print(song_ids) # print(song_names) #list # Unzip List i = 0 # In order for song_id,song_name in zip(song_ids,song_names): # print(song_id) # print(song_name) count_id = song_id.strip('/song?id=') # Remove/song?id= # print(count_id) # Filter contains'$'symbols if ('$' in count_id) == False: # print(count_id) song_url = base_url + count_id # Stitching url # print(song_url) i += 1 mp3 = requests.get(song_url).content # 4. Save Data with open('./yinyue/{}.{}.mp3'.format(i,song_name),'wb') as file: file.write(mp3)
Purpose:
For a screenshot, please analyze it yourself:
This is a familiar picture. There are many ways to get these music on this page, such as downloading something on your own within APP, but this time I want to use a little fur to download it.
As we all know, when downloading music on a web page, the following pages often pop up:
Download a song well, don't make it so difficult.Even after downloading the software, some music needs to be paid or VIP, which is very distressing.Even worse, it's not easy to download, but I find that the format is not equal, which makes people crash.
For this reason, we can solve these problems well with simple crawlers.
Analyze Web pages:
1. At the beginning, I gave a link to this web page:https://music.163.com/#/playlist?id=5087806619 But careful little buddies will find that this is not the address they use in the code:
url = 'https://music.163.com/playlist?id=5087806619'
This is because the web address we requested is not the one on the browser address bar. From this screenshot, it is clear which web address we requested is. (This is an important point, you must learn to analyze.)
2. By opening each song, it is not difficult to find that the 10 songs on this page share a common feature:https://music.163.com/#/song?id=1475436266
The front address is https://music.163.com/#/song?id= It's easy to add the ID number of each song.
3. Nevertheless, it's always too easy to think about. So far, we've only found a few rules, but it's too far to really download each song.
Because we've been analyzing for so long, we can't find the real link to the song.
By looking for these contents, we can say that music files (MPEG, MP3, MPEG-4, MIDI, WMA, M4A, etc.) can not be found at all.
That's why we haven't asked for music so far, and when we click and play, we find it as shown in the following image (compared with the one above):
- The number of requests for this web page increased from 167 to 192, which proves that when we played music, the web page re-requested the web page.
- Second, by rediscovering the discovery (just look at what's new later), you have some.m4a files this time.
- When you click on these files to open, you will find a new Request URL: When you copy the web address to open under a new web page, the following image will appear (when opened, the browser will automatically download the music):
Or: As shown in the image below, the audio appears and when your browser is bound to the Thunder Downloader, the Thunder interface pops up immediately to download the music.
In combination with the above, have we succeeded?But I'm sorry to tell you that this web address will be useful if you open it in a short time, but it has a time limit. Believe it, you can reopen it in five minutes (maybe not that long) and try again.
So, can't I say that?Of course not, there is still some way. Otherwise, how dare to "wanton" here?
Problem solving:
Through previous analysis of the web page, we were step by step familiar with, and in the end, we even found the final URL of the song, but unfortunately, the URL is not permanent, it is only a short, temporary dynamic URL, which simply pours cold water on us.
However, we need not be discouraged. As the saying goes, "One foot tall, one foot tall", there is still a way.
To solve this problem, you have to introduce a new URL:
http://music.163.com/song/media/outer/url?id=
Here, you won't sell off. It's an external link to the platform you know.In the previous analysis, we found an important point, that is, the 10 songs are a new page with a web address and the id of each song.
You will also find the base_used in the codeThe URL is the link.
base_url = 'http://music.163.com/song/media/outer/url?id='
That is, if we have this web address above, we can do whatever we want.Now you can copy the link immediately, find the ID number of a song on the web page, add it to the back of the web address and open it (for example:http://music.163.com/song/media/outer/url?id=1822734959) Did you get the following interface?
Is it familiar?Yes, this is the web page we opened with the web address we analyzed before. Unfortunately, the web address was a temporary and dynamic web address, which was not very useful for us to download in bulk.So, when we have this new web address now, it's a lot easier.
Well, think about it. Now that we have such a magical web address, what's next?Think about it.
Positive Start:
After the analysis of the first two points, we can now easily crawl through these ten pieces of music.
Believe, many people know what to do next?
Each piece of music can be passed through http://music.163.com/song/media/outer/url?id= This web address is downloaded with the ID of each music, so the first step is to find a way to get the ID of each music.
It's not difficult to see from the previous picture that each music's id is in an a tag.
#Import Framework (Library, Module) pip install xxxx import requests # 1. Determine the real address of the web address on the network--Doc url = 'https://music.163.com/playlist?id=5087806619' # 2. requests Picture, Video, Audio content String text html_str = requests.get(url).text print(html_str) print(type(html_str)) # String type
You can use the above code to remove the source code of the web page before you analyze it.
If you use a data type here that adds an extra line to print the page code, it is not difficult to see that the type printed is a string.There is a subsequent need to convert this content to _Element object.
>>>result = etree.HTML(html_str) # Conversion Type >>>print(type(result)) class 'lxml.etree._Element'> #type of output
As_Element object, you can easily use getparent(), remove(), xpath() and other methods.
The exact method used by this crawl is the xpath() method.
Therefore, you need to import a new module:
from lxml import etree
The XPath Helper plug-in of the browser allows you to quickly match the id of each piece of music.
song_ids = result.xpath('//a[contains(@href,"/song?")]/@href') #song id song_names = result.xpath('//a[contains(@href,"/song?")]/text()') #Song Name print(song_ids) print(song_names) #list
When we print it out, we find it is a list type.Don't worry, you can borrow for a quick traversal:
for song_id,song_name in zip(song_ids,song_names): print(song_id) print(song_name)
Printing reveals that there is a little more / song?id= in front of it, and then the following line of code is used to delete it:
count_id = song_id.strip('/song?id=') # Remove/song?id=
The song name was not printed because we mainly get the id of each piece of music. However, when we look carefully at the picture above, we find that there are three more useless links behind it. These three links must be deleted. Otherwise, linking URL s behind them will surely result in an error, because such a web address cannot be found at all.Then you have the following sentence.
# Filter contains'$'symbols if ('$' in count_id) == False: print(count_id)
Unicolor id number:
The next step is to stitch the new URL:
song_url = base_url + count_id # Stitching url print(song_url)
Open any link above in your browser to get a link to the music and download it.
But our ultimate goal is definitely not to have crawlers automatically download and save all of them to folders.
mp3 = requests.get(song_url).content
So we request the web address to get each piece of music.Finally, save it.
# 4. Save Data with open('./yinyue/{}.{}.mp3'.format(i,song_name),'wb') as file: file.write(mp3)
It is important to note that when I traverse for in the source code, I add a variable i for the purpose of whether the music we crawled is stored in folders in the same order as it is in Web pages, and of course it can be deleted if it is not needed.
So far, we have crawled 10 pieces of music from this song list, so imagine if you can get music from other songs in the same way?Do it yourself, go and try it!True knowledge comes from practice!
In the last words:
- Knowledge has no limit.Use the way of blogging to review what you have learned, to enhance your impression and learning.
- At the same time, every note written out is for your reference, and you are welcome to give your advice and exchange learning with each other.
- If infringement occurs, the contact is deleted!