After understanding the multi-threading and multi-process of Python concurrent programming, let's take a look at asynchronous IO programming based on asyncio-coroutines
01
Introduction to Coroutines
Coroutine (Coroutine), also known as micro thread, fiber, coroutine is not a process or thread, its execution process is similar to Python function call, in the asynchronous IO programming framework implemented by Python's asyncio module, coroutine is defined using the async keyword. the call of the async function;
A process contains multiple threads, similar to a human tissue with multiple cells working. Likewise, a program can contain multiple coroutines. Multiple threads are relatively independent, and the switching of threads is controlled by the system. Similarly, multiple coroutines are relatively independent, but their switching is controlled by the program itself.
02
a simple example
Let's use a simple example to understand coroutines, first look at the following code:
import time def display(num): time.sleep(1) print(num) for num in range(10): display(num)
It is easy to understand, the program will output numbers from 0 to 9, and output a number every 1 second, so the execution of the entire program takes about 10 seconds. It is worth noting that because there is no multi-threading or multi-process (concurrency), there is only one execution unit in the program (only one thread is executing), and the sleep operation of time.sleep(1) will cause the entire thread to stall for 1 second,
For the above code, during this time the CPU is idle and doing nothing.
Let's take a look at what happens with coroutines:
import asyncio async def display(num): # Use the async keyword before the function to become an asynchronous function await asyncio.sleep(1) print(num)
Asynchronous functions are different from ordinary functions. Calling ordinary functions will get the return value, while calling asynchronous functions will get a coroutine object. We need to put the coroutine object into an event loop to achieve the effect of cooperating with other coroutine objects, because the event loop will be responsible for handling the operation of subroutine switching.
Simply put, it is to let the blocked subroutine give up the CPU to the executable subroutine.
03
basic concept
Asynchronous IO means that after the program initiates an IO operation (blocking waiting), it can continue other operations without waiting for the IO operation to end; when doing other things, when the IO operation ends, it will be notified and then continue to execute. Asynchronous IO programming is a way to achieve concurrency, suitable for IO-intensive tasks
The Python module asyncio provides an asynchronous programming framework. The overall flow chart is roughly as follows:
The following describes each function from the code level
async: Define a method (function) that will not be executed immediately in subsequent calls but returns a coroutine object;
async def test(): print('hello asynchronous') test() # call async function output: RuntimeWarning: coroutine 'test' was never awaited
coroutine: coroutine object, you can also add the coroutine object to the time loop, it will be called by the event loop;
async def test(): print('hello asynchronous') c = test() # Call the asynchronous function to get the coroutine object -->c print(c) output:<coroutine object test at 0x0000023FD05AA360>
event_loop: event loop, equivalent to an infinite loop, you can add some functions to this event, the function will not be executed immediately, but when certain conditions are met, the function will be executed in a loop;
async def test(): print('hello asynchronous') c = test() # Call the asynchronous function to get the coroutine object -->c loop = asyncio.get_event_loop() # Create an event loop loop.run_until_complete(c) # Throw the coroutine object to the loop and execute the code inside the async function output:hello asynchronous
await: used to suspend the execution of blocking methods;
import asyncio def running1(): async def test1(): print('1') await test2() print('2') async def test2(): print('3') print('4') loop = asyncio.get_event_loop() loop.run_until_complete(test1()) if __name__ == '__main__': running1()
output:
task: task, a further encapsulation of the coroutine object, including the various states of the task;
async def test(): print('hello asynchronous') c = test() # Call the asynchronous function to get the coroutine object -->c loop = asyncio.get_event_loop() # Create an event loop task = loop.create_task(c) # Create task task print(task) loop.run_until_complete(task) # perform tasks output: <Task pending coro=<test() running at D: /xxxx.py>> # task hello asynchronous # Execute the same code inside an asynchronous function
future: Represents tasks that will be executed or not executed in the future, in fact, there is no essential difference between tasks; no code display will be done here;
First create a function using the generic method:
def func(url): print(f'is correcting{url}make a request:') print(f'ask{url}success!') func('www.baidu.com')
The result looks like this:
is correcting www.baidu.com make a request: ask www.baidu.com success
04
Basic operation
Create a coroutine object
Define an asynchronous function with the async keyword, and call the asynchronous function to return a coroutine object.
An asynchronous function is to suspend during the execution of the function to execute other asynchronous functions, wait for the suspending condition (time.sleep(n)) to disappear, and then come back to execute, then let's modify the above code:
async def func(url): print(f'is correcting{url}make a request:') print(f'ask{url}success!') func('www.baidu.com')
The result is as follows:
RuntimeWarning: coroutine 'func' was never awaited
This is mentioned earlier, using the async keyword makes the function call get a coroutine object, the coroutine cannot be run directly, the coroutine needs to be added to the event loop, and the latter will call the coroutine at the appropriate time;
Create task task object
The task task object is a further encapsulation of the coroutine object;
import asyncio async def func(url): print(f'is correcting{url}make a request:') print(f'ask{url}success!') c = func('www.baidu.com') # The function call is written as an object --> c loop = asyncio.get_event_loop() # Create a time loop object task = loop.create_task(c) loop.run_until_complete(task) # Register and start print(task)
The result is as follows:
is correcting www.baidu.com make a request: ask www.baidu.com success! <Task finished coro=<func() done, defined at D:/data_/test.py:10> result=None>
use of future
Earlier we mentioned that there is no essential difference between future and task
async def func(url): print(f'is correcting{url}make a request:') print(f'ask{url}success!') c = func('www.baidu.com') # The function call is written as an object --> c loop = asyncio.get_event_loop() # Create a time loop object future_task = asyncio.ensure_future(c) print(future_task,'Not performed') loop.run_until_complete(future_task) # Register and start print(future_task,'finished')
The result is as follows:
<Task pending coro=<func() running at D:/data/test.py:10>>Not performed is correcting www.baidu.com make a request: ask www.baidu.com success! <Task finished coro=<func() done, defined at D:/data/test.py:10> result=None> finished
Use of await keyword
In asynchronous functions, you can use the await keyword to suspend time-consuming operations (such as network requests, file reading, etc. IO operations). For example, when an asynchronous program executes to a certain step, it takes a long time to wait. Suspend to execute other asynchronous functions
import asyncio, time async def do_some_work(n): #Defining asynchronous functions using the async keyword print('wait:{}second'.format(n)) await asyncio.sleep(n) #sleep for a while return '{}Return to end after seconds'.format(n) start_time = time.time() #Starting time coro = do_some_work(2) loop = asyncio.get_event_loop() # Create an event loop object loop.run_until_complete(coro) print('operation hours: ', time.time() - start_time)
The running result is as follows:
wait:2 second operation hours: 2.001312017440796
05
multitasking coroutine
The task object is used to encapsulate the coroutine object, save the state after the coroutine runs, and use the run_until_complete() method to register the task to the event loop;
If we want to use multitasking, then we need to register a list of multiple tasks at the same time, we can use run_until_complete(asyncio.wait(tasks)),
Here tasks, represents a sequence of tasks (usually a list)
Registering multiple tasks can also use run_until_complete(asyncio.gather(*tasks))
import asyncio, time async def do_some_work(i, n): #Defining asynchronous functions using the async keyword print('Task{}wait: {}second'.format(i, n)) await asyncio.sleep(n) #sleep for a while return 'Task{}exist{}Return to end after seconds'.format(i, n) start_time = time.time() #Starting time tasks = [asyncio.ensure_future(do_some_work(1, 2)), asyncio.ensure_future(do_some_work(2, 1)), asyncio.ensure_future(do_some_work(3, 3))] loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) for task in tasks: print('task execution result: ', task.result()) print('operation hours: ', time.time() - start_time)
The running result is as follows:
task 1 waiting: 2 second task 2 waiting: 1 second task 3 waiting: 3 second task execution result: Task 1 returns to end running after 2 seconds Task execution result: Task 2 returns to end running after 1 second Task execution result: Task 3 returns to end running after 3 seconds Running time: 3.0028676986694336
06
Actual combat|Crawling LOL skin
First open the official website:
You can see the list of heroes, which will not be shown in detail here. We know that a hero has multiple skins. Our goal is to crawl all the skins of each hero and save them to the corresponding folder;
Open a hero's skin page as follows:
The daughter of darkness, the rabbit below corresponds to the skin of the Cain Brothers, and then by viewing the network, the corresponding skin data is found in the js file;
Then we found the url link rules for hero skins:
url1 = 'https://game.gtimg.cn/images/lol/act/img/js/hero/1.js' url2 = 'https://game.gtimg.cn/images/lol/act/img/js/hero/2.js' url3 = 'https://game.gtimg.cn/images/lol/act/img/js/hero/3.js'
We found that only the id parameter is dynamically constructed, and the rule is:
'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js'.format(i)
But this id is only in front of it in order, find the id of the corresponding hero on the page showing all heroes,
The ids of the last few heroes are intercepted here, so to crawl all of them, you need to set the ids first. Since the previous ones are in order, here we will crawl the skins of the first 20 heroes;
1. Get the hero skin ulr address:
The previous hero id s are all in order. You can use range(1,21) to dynamically construct the url;
def get_page(): page_urls = [] for i in range(1,21): url = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js'.format(i) print(url) page_urls.append(url) return page_urls
2. Request the url address of each page
And parse the webpage to get the url address of the skin image:
def get_img(): img_urls = [] page_urls = get_page() for page_url in page_urls: res = requests.get(page_url, headers=headers) result = res.content.decode('utf-8') res_dict = json.loads(result) skins = res_dict["skins"] for hero in skins: item = {} item['name'] = hero["heroName"] item['skin_name'] = hero["name"] if hero["mainImg"] == '': continue item['imgLink'] = hero["mainImg"] print(item) img_urls.append(item) return img_urls
illustrate:
-
res_dict = json.loads(result) : Convert the obtained json format string to dictionary format;
-
heroName: hero name (this must be the same, so that we can create a folder based on the hero name later);
-
name: Indicates the complete name, including the skin name (this must be different) Some 'mainImg' is empty, we need to make a judgment;
3. Create a coroutine function
Here we create a folder based on the hero name, and then pay attention to the naming of the picture, don't forget /, the directory structure is established
async def save_img(index, img_url): path = "skin/" + img_url['name'] if not os.path.exists(path): os.makedirs(path) content = requests.get(img_url['imgLink'], headers=headers).content with open('./skin/' + img_url['name'] + '/' + img_url['skin_name'] + str(index) + '.jpg', 'wb') as f: f.write(content)
Main function:
def main(): loop = asyncio.get_event_loop() img_urls = get_img() print(len(img_urls)) tasks = [save_img(img[0], img[1]) for img in enumerate(img_urls)] try: loop.run_until_complete(asyncio.wait(tasks)) finally: loop.close()
4. Program running
if __name__ == '__main__': start = time.time() main() end = time.time() print(end - start)
operation result:
It took 42s to download 233 pictures, and you can see that the speed is ok. The results of the file directory are as follows:
Compare with requests
After crawling pictures asynchronously, it is necessary for us to use requests to crawl synchronous data and compare efficiency, so we modify it on the basis of the original code. Here we skip it directly. The idea is the same. This is to use The event loop can be replaced with a loop:
img_urls = get_img() print(len(img_urls)) for i,img_url in enumerate(img_urls): save_img(i,img_url)
We can see that using coroutines is a bit faster than requests.
The above is the whole content of this article. Interested readers can type the code by themselves~