Python coroutine asyncio minimalist introduction and crawler combat

After understanding the multi-threading and multi-process of Python concurrent programming, let's take a look at asynchronous IO programming based on asyncio-coroutines

01

Introduction to Coroutines

Coroutine (Coroutine), also known as micro thread, fiber, coroutine is not a process or thread, its execution process is similar to Python function call, in the asynchronous IO programming framework implemented by Python's asyncio module, coroutine is defined using the async keyword. the call of the async function;

A process contains multiple threads, similar to a human tissue with multiple cells working. Likewise, a program can contain multiple coroutines. Multiple threads are relatively independent, and the switching of threads is controlled by the system. Similarly, multiple coroutines are relatively independent, but their switching is controlled by the program itself.

02

a simple example

Let's use a simple example to understand coroutines, first look at the following code:

import time
def display(num):
    time.sleep(1)
print(num)
for num in range(10):
    display(num)

It is easy to understand, the program will output numbers from 0 to 9, and output a number every 1 second, so the execution of the entire program takes about 10 seconds. It is worth noting that because there is no multi-threading or multi-process (concurrency), there is only one execution unit in the program (only one thread is executing), and the sleep operation of time.sleep(1) will cause the entire thread to stall for 1 second,

For the above code, during this time the CPU is idle and doing nothing.

Let's take a look at what happens with coroutines:

import asyncio
async def display(num): # Use the async keyword before the function to become an asynchronous function await asyncio.sleep(1)
print(num)

Asynchronous functions are different from ordinary functions. Calling ordinary functions will get the return value, while calling asynchronous functions will get a coroutine object. We need to put the coroutine object into an event loop to achieve the effect of cooperating with other coroutine objects, because the event loop will be responsible for handling the operation of subroutine switching.

Simply put, it is to let the blocked subroutine give up the CPU to the executable subroutine.

03

basic concept

Asynchronous IO means that after the program initiates an IO operation (blocking waiting), it can continue other operations without waiting for the IO operation to end; when doing other things, when the IO operation ends, it will be notified and then continue to execute. Asynchronous IO programming is a way to achieve concurrency, suitable for IO-intensive tasks

The Python module asyncio provides an asynchronous programming framework. The overall flow chart is roughly as follows:

The following describes each function from the code level

async: Define a method (function) that will not be executed immediately in subsequent calls but returns a coroutine object;

async def test(): print('hello asynchronous')
test() # call async function

output: RuntimeWarning: coroutine 'test' was never awaited

coroutine: coroutine object, you can also add the coroutine object to the time loop, it will be called by the event loop;

async def test(): 
  print('hello asynchronous')
c = test() # Call the asynchronous function to get the coroutine object -->c 
print(c)

output:<coroutine object test at 0x0000023FD05AA360>

event_loop: event loop, equivalent to an infinite loop, you can add some functions to this event, the function will not be executed immediately, but when certain conditions are met, the function will be executed in a loop;

async def test(): 
  print('hello asynchronous')
c = test() # Call the asynchronous function to get the coroutine object -->c
loop = asyncio.get_event_loop() # Create an event loop 
loop.run_until_complete(c) # Throw the coroutine object to the loop and execute the code inside the async function

output:hello asynchronous

await: used to suspend the execution of blocking methods;

import asyncio
def running1():
    async def test1():
        print('1')
        await test2()
        print('2')
    async def test2():
        print('3')
        print('4')
    loop = asyncio.get_event_loop()
    loop.run_until_complete(test1())
if __name__ == '__main__':
    running1()

output:

task: task, a further encapsulation of the coroutine object, including the various states of the task;

async def test(): 
  print('hello asynchronous')
c = test() # Call the asynchronous function to get the coroutine object -->c
loop = asyncio.get_event_loop() # Create an event loop 
task = loop.create_task(c) # Create task task 
print(task)
loop.run_until_complete(task) # perform tasks

output:
<Task pending coro=<test() running at D: /xxxx.py>> # task
hello asynchronous # Execute the same code inside an asynchronous function

future: Represents tasks that will be executed or not executed in the future, in fact, there is no essential difference between tasks; no code display will be done here;

First create a function using the generic method:

def func(url): 
  print(f'is correcting{url}make a request:') 
  print(f'ask{url}success!')
func('www.baidu.com')

The result looks like this:

is correcting www.baidu.com make a request:
ask www.baidu.com success

04

Basic operation

Create a coroutine object

Define an asynchronous function with the async keyword, and call the asynchronous function to return a coroutine object.

An asynchronous function is to suspend during the execution of the function to execute other asynchronous functions, wait for the suspending condition (time.sleep(n)) to disappear, and then come back to execute, then let's modify the above code:

async def func(url): 
  print(f'is correcting{url}make a request:') 
  print(f'ask{url}success!')
func('www.baidu.com')

The result is as follows:

RuntimeWarning: coroutine 'func' was never awaited

This is mentioned earlier, using the async keyword makes the function call get a coroutine object, the coroutine cannot be run directly, the coroutine needs to be added to the event loop, and the latter will call the coroutine at the appropriate time;

Create task task object

The task task object is a further encapsulation of the coroutine object;

import asyncio
async def func(url): 
  print(f'is correcting{url}make a request:') 
  print(f'ask{url}success!')
c = func('www.baidu.com') # The function call is written as an object --> c
loop = asyncio.get_event_loop() # Create a time loop object 
task = loop.create_task(c) 
loop.run_until_complete(task) # Register and start 
print(task)

The result is as follows:

is correcting www.baidu.com make a request:
ask www.baidu.com success!
<Task finished coro=<func() done, defined at D:/data_/test.py:10> result=None>

use of future

Earlier we mentioned that there is no essential difference between future and task

async def func(url): 
  print(f'is correcting{url}make a request:') 
  print(f'ask{url}success!')
c = func('www.baidu.com') # The function call is written as an object --> c 

loop = asyncio.get_event_loop() # Create a time loop object
future_task = asyncio.ensure_future(c) 
print(future_task,'Not performed') 
loop.run_until_complete(future_task) # Register and start 
print(future_task,'finished')

The result is as follows:

<Task pending coro=<func() running at D:/data/test.py:10>>Not performed
 is correcting www.baidu.com make a request:
ask www.baidu.com success!
<Task finished coro=<func() done, defined at D:/data/test.py:10> result=None> finished

Use of await keyword

In asynchronous functions, you can use the await keyword to suspend time-consuming operations (such as network requests, file reading, etc. IO operations). For example, when an asynchronous program executes to a certain step, it takes a long time to wait. Suspend to execute other asynchronous functions

import asyncio, time
async def do_some_work(n): #Defining asynchronous functions using the async keyword
  print('wait:{}second'.format(n))
  await asyncio.sleep(n) #sleep for a while 
  return '{}Return to end after seconds'.format(n)
start_time = time.time() #Starting time
coro = do_some_work(2)
loop = asyncio.get_event_loop() # Create an event loop object 
loop.run_until_complete(coro)
print('operation hours: ', time.time() - start_time)

The running result is as follows:

wait:2 second
 operation hours: 2.001312017440796

05

multitasking coroutine

The task object is used to encapsulate the coroutine object, save the state after the coroutine runs, and use the run_until_complete() method to register the task to the event loop;

If we want to use multitasking, then we need to register a list of multiple tasks at the same time, we can use run_until_complete(asyncio.wait(tasks)),

Here tasks, represents a sequence of tasks (usually a list)

Registering multiple tasks can also use run_until_complete(asyncio.gather(*tasks))

import asyncio, time
async def do_some_work(i, n): #Defining asynchronous functions using the async keyword
  print('Task{}wait: {}second'.format(i, n))
  await asyncio.sleep(n) #sleep for a while
  return 'Task{}exist{}Return to end after seconds'.format(i, n)
start_time = time.time() #Starting time
tasks = [asyncio.ensure_future(do_some_work(1, 2)),
        asyncio.ensure_future(do_some_work(2, 1)),
        asyncio.ensure_future(do_some_work(3, 3))]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
for task in tasks:
  print('task execution result: ', task.result()) 
print('operation hours: ', time.time() - start_time)

The running result is as follows:

task 1 waiting: 2 second
 task 2 waiting: 1 second
 task 3 waiting: 3 second
 task execution result: Task 1 returns to end running after 2 seconds Task execution result: Task 2 returns to end running after 1 second Task execution result: Task 3 returns to end running after 3 seconds Running time: 3.0028676986694336

06

Actual combat|Crawling LOL skin

First open the official website:

You can see the list of heroes, which will not be shown in detail here. We know that a hero has multiple skins. Our goal is to crawl all the skins of each hero and save them to the corresponding folder;

Open a hero's skin page as follows:

The daughter of darkness, the rabbit below corresponds to the skin of the Cain Brothers, and then by viewing the network, the corresponding skin data is found in the js file;

Then we found the url link rules for hero skins:

url1 = 'https://game.gtimg.cn/images/lol/act/img/js/hero/1.js' 
url2 = 'https://game.gtimg.cn/images/lol/act/img/js/hero/2.js' 
url3 = 'https://game.gtimg.cn/images/lol/act/img/js/hero/3.js'

We found that only the id parameter is dynamically constructed, and the rule is:

'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js'.format(i)

But this id is only in front of it in order, find the id of the corresponding hero on the page showing all heroes,

The ids of the last few heroes are intercepted here, so to crawl all of them, you need to set the ids first. Since the previous ones are in order, here we will crawl the skins of the first 20 heroes;

1. Get the hero skin ulr address:

The previous hero id s are all in order. You can use range(1,21) to dynamically construct the url;

def get_page():
  page_urls = [] 
  for i in range(1,21):
      url = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js'.format(i)
      print(url)
      page_urls.append(url) 
  return page_urls

2. Request the url address of each page

And parse the webpage to get the url address of the skin image:

def get_img():
  img_urls = [] 
  page_urls = get_page() 
  for page_url in page_urls:
      res = requests.get(page_url, headers=headers)
      result = res.content.decode('utf-8')
      res_dict = json.loads(result)
      skins = res_dict["skins"]

      for hero in skins:
        item = {}
        item['name'] = hero["heroName"]
        item['skin_name'] = hero["name"]
        if hero["mainImg"] == '':
          continue
        item['imgLink'] = hero["mainImg"]
        print(item)
        img_urls.append(item)
    return img_urls

illustrate:

  • res_dict = json.loads(result) : Convert the obtained json format string to dictionary format;

  • heroName: hero name (this must be the same, so that we can create a folder based on the hero name later);

  • name: Indicates the complete name, including the skin name (this must be different) Some 'mainImg' is empty, we need to make a judgment;

3. Create a coroutine function

Here we create a folder based on the hero name, and then pay attention to the naming of the picture, don't forget /, the directory structure is established

async def save_img(index, img_url):
    path = "skin/" + img_url['name']
    if not os.path.exists(path):
        os.makedirs(path)
    content = requests.get(img_url['imgLink'], headers=headers).content
    with open('./skin/' + img_url['name'] + '/' + img_url['skin_name'] + str(index) + '.jpg', 'wb') as f:
        f.write(content)

Main function:

def main():
    loop = asyncio.get_event_loop() 
    img_urls = get_img() print(len(img_urls)) 
    tasks = [save_img(img[0], img[1]) for img in enumerate(img_urls)] 
    try:
        loop.run_until_complete(asyncio.wait(tasks)) 
    finally:
        loop.close()

4. Program running

if __name__ == '__main__':
    start = time.time() 
    main() 
    end = time.time() 
    print(end - start)

operation result:

It took 42s to download 233 pictures, and you can see that the speed is ok. The results of the file directory are as follows:

Compare with requests

After crawling pictures asynchronously, it is necessary for us to use requests to crawl synchronous data and compare efficiency, so we modify it on the basis of the original code. Here we skip it directly. The idea is the same. This is to use The event loop can be replaced with a loop:

img_urls = get_img() 
print(len(img_urls)) 
for i,img_url in enumerate(img_urls):
    save_img(i,img_url)

We can see that using coroutines is a bit faster than requests.

The above is the whole content of this article. Interested readers can type the code by themselves~

Tags: Python crawler programming language

Posted by PC Nerd on Mon, 25 Jul 2022 04:02:21 +0930