Paddlepaddlepaddlepaddle information server deployment model

catalogue

1. Introduction

1.1 comparison with model direct reasoning

1.2 flow chart

1.3 high performance

1.4 multi function integration

2. Forecast deployment example -- python

2.1 model acquisition

2.1.1 paddlepaddle framework

 2.1.2Tensorflow,Pytorch,Caffe

2.2 use of padding information

2.3 complete runnable code

1. Introduction

Paddle Inference is the original reasoning Library of the propeller, which acts on the server and cloud and provides high-performance reasoning ability.

Because the ability is directly based on the training operator of the propeller, the paddle influence can generally support all models trained by the propeller.

Paddle influence has rich functional features and excellent performance. It has carried out in-depth adaptation and Optimization for different application scenarios on different platforms, so as to achieve high throughput and low delay, so as to ensure that the propeller model can be trained and used on the server side and deployed quickly.

1.1 comparison with model direct reasoning

The trained model can be directly transformed into eval evaluation mode for reasoning, so why use the information library.

Compared with the reasoning mode of the model, information can use MKLDNN, CUDNN and TensorRT for prediction acceleration. At the same time, it supports the model produced from the third-party framework (TensorFlow, Pytorh, Caffe, etc.) with x2pad tool, and can link PaddleSlim to support the model deployment after loading, quantification, clipping and distillation.

The model of eval mode is suitable for the direct prediction of the trained model, and the pad information is suitable for the users who have requirements for reasoning performance and universality. It has carried out in-depth adaptation and Optimization for different application scenarios on different platforms to ensure that the model is ready for training and rapid deployment on the server side.

1.2 flow chart

1.3 high performance

  • Memory / video memory reuse improves service throughput

    • In the reasoning initialization stage, the dependency analysis of the OP output Tensor in the model is carried out, and the two independent tensors are reused in the memory / video memory space, so as to increase the amount of computing parallelism and improve the service throughput.
  • Fine grained OP horizontal and vertical fusion reduces the amount of calculation

    • In the reasoning initialization stage, multiple ops in the model are fused into one OP according to the existing fusion mode, which not only reduces the amount of calculation of the model, but also reduces the number of kernel launches, so as to improve the reasoning performance. At present, there are dozens of fusion modes supported by Paddle Inference.
  • Built in high-performance CPU/GPU Kernel

    • The built-in high-performance kernel jointly built with Intel and Nvidia ensures the high-performance execution of model reasoning.

1.4 multi function integration

  • Integrating TensorRT to speed up GPU reasoning

    • Paddle Inference integrates TensorRT in the form of subgraphs. For GPU reasoning scenarios, TensorRT can optimize some subgraphs, including horizontal and vertical integration of OP, filter redundant OP, and automatically select the optimal kernel for OP to speed up reasoning.
  • Integrated oneDNN CPU reasoning acceleration engine

    • One line of code starts oneDNN acceleration, fast and efficient.
  • Support the deployment of PaddleSlim Quantization Compression Model

    • PaddleSlim is a compression tool for propeller deep learning model. PaddleSlim can be linked by paddeinference to support loading and deployment of quantified, cropped and distilled models, so as to reduce model storage space, reduce calculation memory and speed up model reasoning. In terms of model quantification, paddy inference has made in-depth optimization on X86 CPU. The single thread performance of common classification models can be improved by nearly three times, and the single thread performance of ERNIE model can be improved by 2.68 times.
  • Support the model obtained by x2pad transformation

    • In addition to supporting the model of propeller training, it also supports the model produced from the third-party framework (such as TensorFlow, pytoch or cafe) with x2pad tool.

2. Forecast deployment example -- python

2.1 model acquisition

2.1.1 paddlepaddle framework

paddlepaddle not only supports dynamic graph but also supports the development of static graph. The trained dynamic graph model can be saved by two parts as static graph, and static diagram is better for reasoning and deployment.

Step 1: load the model and parameters, and use the paddle load()

model_state_dict = paddle.load('lenet.pdparams')
opt_state_dict = paddle.load('lenet.pdopt')
model.set_state_dict(model_state_dict)
optim.set_state_dict(opt_state_dict)

Step 2: save after converting to static diagram and call paddle jit. to_ Static conversion, padding jit. Save save

net = to_static(model, input_spec=[InputSpec(shape=[None, 1, 28, 28], name='x')])
paddle.jit.save(net, 'inference_model/lenet')

 2.1.2Tensorflow,Pytorch,Caffe

X2Paddle tools are used to transform models from third party frameworks (TensorFlow, Pytorh, Caffe, etc.) to paddlepaddle inference supported models.

Use of 2 padding

Simple steps to use:

Load the prediction model and configure the prediction

First, we load the prediction model, configure some options during prediction, and create a prediction engine according to the configuration:

config = Config("inference_model/lenet/lenet.pdmodel", "inference_model/lenet/lenet.pdiparams") # Load through model and parameter file path
config.disable_gpu() # Using cpu prediction
predictor = create_predictor(config) # Create a prediction engine predictor based on the prediction configuration

Set input

We first get the name of the input Tensor, and then get the handle of the input Tensor according to the name.

# Get input variable name
input_names = predictor.get_input_names()
input_handle = predictor.get_input_handle(input_names[0])

Now we are ready to input data and copy it to the device to be predicted. Here we use random data, which can be replaced by real pictures that need to be predicted in practical use.  

### Set input
fake_input = np.random.randn(1, 1, 28, 28).astype("float32")
input_handle.reshape([1, 1, 28, 28])
input_handle.copy_from_cpu(fake_input)

Operation prediction

predictor.run()

Get output

# Get output variable name
output_names = predictor.get_output_names()
output_handle = predictor.get_output_handle(output_names[0])
output_data = output_handle.copy_to_cpu()

2.3 complete runnable code

Prepare the model: ResNet50, which can be downloaded from the link below. This model is a paddle static graph model and can be used directly

wget https://paddle-inference-dist.bj.bcebos.com/Paddle-Inference-Demo/resnet50.tgz
tar zxf resnet50.tgz

# Obtain the model directory, i.e. the file is as follows
resnet50/
├── inference.pdmodel
├── inference.pdiparams.info
└── inference.pdiparams

Save the following code as python_demo.py file:

As you can see, this is the program obtained from the code described in the above information usage and the argparse parsing library

import argparse
import numpy as np

# Reference paddle information prediction Library
import paddle.inference as paddle_infer

def main():
    args = parse_args()

    # Create config
    config = paddle_infer.Config(args.model_file, args.params_file)

    # Create predictor according to config
    predictor = paddle_infer.create_predictor(config)

    # Gets the name entered
    input_names = predictor.get_input_names()
    input_handle = predictor.get_input_handle(input_names[0])

    # Set input
    fake_input = np.random.randn(args.batch_size, 3, 318, 318).astype("float32")
    input_handle.reshape([args.batch_size, 3, 318, 318])
    input_handle.copy_from_cpu(fake_input)

    # Run predictor
    predictor.run()

    # Get output
    output_names = predictor.get_output_names()
    output_handle = predictor.get_output_handle(output_names[0])
    output_data = output_handle.copy_to_cpu() # numpy.ndarray type
    print("Output data size is {}".format(output_data.size))
    print("Output data shape is {}".format(output_data.shape))

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_file", type=str, help="model filename")
    parser.add_argument("--params_file", type=str, help="parameter filename")
    parser.add_argument("--batch_size", type=int, default=1, help="batch size")
    return parser.parse_args()

if __name__ == "__main__":
    main()

Execution procedure:

# The parameter in step 50 of this chapter is the input of netres2
python python_demo.py --model_file ./resnet50/inference.pdmodel --params_file ./resnet50/inference.pdiparams --batch_size 2

result

 

 

Tags: server paddlepaddle paddle

Posted by snowman2344 on Sat, 12 Feb 2022 19:16:57 +1030