Wechat applet uses tensorflow JS to complete the front-end face business

Completed content

In the last chapter, I made a detailed interpretation of the blazeface api and pointed out the shortcomings and improvement direction of the api code. Then I made a simple selection of images for face recognition and calibration of face prediction frame and key points.

Objectives of this chapter

  1. Based on the shortcomings of the blazeface api in the previous chapter, improve the api code to make it more suitable for the front-end development of wechat applet
  2. Improve the project business, realize real-time face detection and real-time calibration of prediction frame and key points

Improve the blazeface api

First, we need to determine the direction of improvement:

  1. Every time the program starts and api is called, the model weight and structure data should be pulled from the Internet. It should be designed to save the data to the local after pulling it for the first time, and then it can be recognized offline
  2. Some of the api logic is messy. You can package the image preprocessing into a function, and then package the processing of the predicted results into a function, so as to improve the readability of the code
  3. Some input parameters are meaningless. For example, it is meaningless to predict the input of face input and receive HTMLVideoElement, HTMLImageElement, HTMLCanvasElement and other types, because wechat applet can't get them
  4. Put all the information about tensorflow The operation of tfjs is put into the api, and the external call should not involve any operation on tfjs at all, so as to improve the code decoupling
  5. The input is processed by intercepting square to improve the prediction accuracy of face prediction frame and key points
  6. Improve performance, improve prediction speed and reduce memory consumption

OK! Next, let's improve the code one by one

Improve the pre operation of api

First of all, the api source code is written in typescript. Typescript is a super class of JavaScript. The interface can be run only after tsc converts the typescript code into JavaScript code. Therefore, next, we need to add the relevant environment of typescript in the created face detect project.

For existing JavaScript applet projects, there are many ways to add typescript environment on the Internet course Because I use GitHub managed code, you can check Historical changes , know what changes I made.

At the same time, in order to facilitate debugging, I made some modifications on the basis of the business: takephoto page in the previous chapter (I wrote down the image path, so I don't need to choose the image every time), and named it myapi.

In this way, the pre operation of improving the api is completed.

off-line recognition

For wechat applet, tensorflow JS provides two schemes of local storage model: local storage caching and saving as user files, and official detailed tutorials Here , I use the localStorage cache scheme here, because the loading of the model is in index TS, so it's in index Ts can be modified accordingly. The implementation is simple, so I won't repeat it.

Modification and optimization of decodebonds function

Let's analyze the following functions

function decodeBounds(
    boxOutputs: tf.Tensor2D, anchors: tf.Tensor2D,
    inputSize: tf.Tensor1D): tf.Tensor2D {
  const boxStarts = tf.slice(boxOutputs, [0, 1], [-1, 2]);
  const centers = tf.add(boxStarts, anchors);
  const boxSizes = tf.slice(boxOutputs, [0, 3], [-1, 2]);

  const boxSizesNormalized = tf.div(boxSizes, inputSize);
  const centersNormalized = tf.div(centers, inputSize);

  const halfBoxSize = tf.div(boxSizesNormalized, 2);
  const starts = tf.sub(centersNormalized, halfBoxSize);
  const ends = tf.add(centersNormalized, halfBoxSize);

  const startNormalized = tf.mul(starts, inputSize);
  const endNormalized = tf.mul(ends, inputSize);

  const concatAxis = 1;
  return tf.concat2d(
      [startNormalized as tf.Tensor2D, endNormalized as tf.Tensor2D],
      concatAxis);
}

boxOutputs and anchor are two-dimensional tensors, which are actually two-dimensional arrays, that is, a table. The size of the table is 896 * 17, which means that there are 896 anchor points in total. For each anchor point, there are 17 prediction attributes, and we only use the first five. Here, note that the original point of the central coordinate of the prediction frame is each anchor point, so boxStarts intercepts the orange area, while boxsizes intercepts the green area, centers is the center coordinate of the prediction box whose origin is the upper left corner of the picture. Then normalize these data, calculate the coordinates of the upper left corner and the lower right corner of the prediction frame, and then do inverse normalization to change back to the original value range, and then do line splicing to output the results. If you want to verify yourself, I recommend using breakpoint debugging method, so that the input and output data of each line of code can be seen at a glance.


I think the data operated here is not large, no more than 128, and there are not many operation steps. There is no need for normalization and anti normalization steps, so I deleted them, which reduces the amount of code, improves readability and speeds up the operation of the program. It is further proved that normalization and inverse normalization are not needed here.

The deletion result is as follows:

function decodeBounds(
    boxOutputs: tf.Tensor2D, anchors: tf.Tensor2D): tf.Tensor2D {
  const boxStarts = tf.slice(boxOutputs, [0, 1], [-1, 2]);
  const centers = tf.add(boxStarts, anchors);
  const boxSizes = tf.slice(boxOutputs, [0, 3], [-1, 2]);

  const halfBoxSize = tf.div(boxSizes, 2);
  const starts = tf.sub(centers, halfBoxSize);
  const ends = tf.add(centers, halfBoxSize);

  const concatAxis = 1;
  return tf.concat2d(
      [starts as tf.Tensor2D, ends as tf.Tensor2D],
      concatAxis);
}

Encapsulated picture preprocessing function

 async preprocess(image:FrameData){
    return tf.tidy(()=>{
      // Turn uint8 data into tensor data
      const tensor3dImage = tf.browser.fromPixels(image);
      // Intercept the square in the middle of the picture
      const squareImage = this.makeSquare(tensor3dImage)
      // Reduce the picture to 128 * 128
      const resizedImage = tf.image.resizeBilinear(squareImage,
        [this.width, this.height]);
      // Add a batch_size to the picture to meet the prediction input requirements
      const tensor4dImage = tf.expandDims(resizedImage, 0)
      // Picture normalized from [0255] to [- 1,1]
      // int[0,255] -cast-> float[0,255] -div-> float[0,2] -sub-> float[-1,1]
      const normalizedImage = tf.sub(tf.div(tf.cast(tensor4dImage, 'float32'), 127.5), 1);
      return normalizedImage
    })
  }

makeSquare is a newly added function, which aims to cut the picture into a square picture. If you are interested, you can check the source code by yourself, which will not be repeated here.

Why cut the picture into a square? In short, there are two reasons:

  1. Picture distortion. Because the input required by the model is a square picture, if you don't cut it into a square, use resize, and the picture will be stretched (distorted), which is unfavorable for prediction.
  2. The prediction frame will be affected by the aspect ratio of the original picture. The a priori box (the width and height of the prediction box preset by the model) is square, but the prediction box does not look square because the scaling ratio of x coordinate and y coordinate is different when the prediction box is mapped to the original image in the later stage. This means that if the aspect ratio of the input picture is different, the aspect ratio of the prediction frame is different.
  3. For mobile terminal development, this will become a fatal error. Because we can't predict what aspect ratio pictures the mobile camera will output, that means we can't predict the prediction frame aspect ratio.

So it is necessary to cut the image into squares.

Analysis method of packaging prediction results

 async postprocess(res: tf.Tensor3D){
   // Obtain the decoded prediction frame and corresponding confidence
   const [outputs, boxes, scores] =tf.tidy(()=>{
     const prediction = tf.squeeze(res); 
     const decodedBounds = decodeBounds(prediction as tf.Tensor2D, this.anchors);
     const logits = tf.slice(prediction as tf.Tensor2D, [0, 0], [-1, 1]);
     const scores = tf.squeeze(tf.sigmoid(logits));
     return [prediction as tf.Tensor2D, decodedBounds, scores as tf.Tensor1D];
   })

   // Non maximum suppression. Because there is no synchronization mode, it can only be put outside tidy
   const indicesTensor = await tf.image.nonMaxSuppressionAsync(
     boxes, scores, this.maxFaces, this.iouThreshold, this.scoreThreshold);
   const indices = indicesTensor.arraySync()

   // According to the suppression results, the effective prediction frame, key points and confidence are intercepted
   const [topLefts, bottomRights, score, landmarks] = tf.tidy(()=>{
     const suppressBox: tf.Tensor2D= tf.gather(boxes,indicesTensor)
     const topLefts = tf.slice(suppressBox,[0,0],[-1,2])
     .mul(this.scaleFactor).add(tf.tensor1d([this.offsetX,this.offsetY]))
     const bottomRights = tf.slice(suppressBox,[0,2],[-1,2])
     .mul(this.scaleFactor).add(tf.tensor1d([this.offsetX,this.offsetY]))
     const suppressScore = tf.gather(scores,indicesTensor)
     const suppressOutput = tf.gather(outputs, indicesTensor)
     const landmarks = tf.slice(suppressOutput,[0,5],[-1,-1])
     .reshape([-1,NUM_LANDMARKS,2])
     return [topLefts.arraySync(),bottomRights.arraySync(),suppressScore.arraySync(),landmarks.arraySync()]
   })
   
   // Delete useless tensors to prevent memory leakage
   outputs.dispose()
   boxes.dispose()
   scores.dispose()

   // Decode the key points and encapsulate them into a normalized face array
   const normalizedFaces:NormalizedFace[] = []
   for(let i in indices){
     const normalizedLandmark = (landmarks[i]).map((landmark:[number,number])=>([
       (landmark[0]+this.anchorsData[indices[i]][0])*this.scaleFactor+this.offsetX,
       (landmark[1]+this.anchorsData[indices[i]][1])*this.scaleFactor+this.offsetY
     ]))
     const normalizedFace = {
       topLeft: topLefts[i],
       bottomRight: bottomRights[i],
       landmarks: normalizedLandmark,
       probability: score[i]
     }
     normalizedFaces.push(normalizedFace)
   }
   indicesTensor.dispose()
   return normalizedFaces
 }

Here is a new function: TF Gather, the official interpretation of this function here , its function is to select elements from the original tensor according to the subscript to form a new tensor. Comparison between original code and new code:

// Original code
boxIndices.map((boxIndex: number) => tf.slice(boxes, [boxIndex, 0], [1, -1]))
// New code
const suppressBox: tf.Tensor2D = tf.gather(boxes, indicesTensor)

To finish the work. You can see using TF The gather code is more readable and actually more efficient, because the original code returns an array, and each element of the array is a tensor, while the new code generates only one tensor.

For other contents of the parsing function, I have written notes, which is not difficult to read. I suggest debugging at breakpoints to deepen my understanding.

Delete useless code

Delete the following code segment:

  1. box.ts
  2. BlazeFacePrediction
  3. getInputTensorDimensions
  4. flipFaceHorizontal
  5. scaleBoxFromPrediction
  6. getBoundingBoxes

It can be found that there are many lines of deleted code, and most of their functions are realized by two functions: image preprocessing and prediction result parsing. After deletion, the program still runs normally, which proves that they have indeed been replaced by those two functions.

Input parameter adjustment

Because we use wechat applet, we can get limited content. In fact, we can only get the ArrayBuffer and picture length and width from the onCameraFrame callback function (of course, you can also use the takePhoto function to get the temporary path of the picture, then draw the picture on the canvas, and finally use wx.canvasGetImageData to get ImageData). We can't get data such as HTMLImageElement at all. Therefore, we adjust the input parameters and change the prediction entry function as follows:

async estimateFaces(image: Uint8Array, 
width: number, height: number): Promise<NormalizedFace[]>{}

Summarize and improve the code

Here, we have improved the source code of blazeface api. It not only adds the function of model storage, increases the readability of the code, but also reduces the amount of code by nearly half, from more than 400 lines to more than 200 lines. Of course, the running speed is also improved a little.

Next, let's complete real-time face detection, face prediction box and key point identification!

Real time face detection

Here is the source code , the code logic is very simple. I only emphasize here that the preview content of camera component is obtained by zooming and centering each frame of picture. Specifically, please see the following diagram:

In other words, the camera preview content does not directly display the frame data, so when you get the prediction frame and key point information, you need to make some corrections before you can calibrate the detection frame and key points into the interface, and the specific calculation method See source code , the coordinate correction is completed through the transformPoint function.

summary

In this chapter, we mainly completed the modification of balzeface api, so that the api can better adapt to the development of wechat applet, and then completed the real-time face detection business. Since then, the front-end face detection business has been basically completed. Next, we can improve the UI interface, and then start the development of the back-end. After all, wechat applet does the work of the front-end. Without the support of the back-end, complex business cannot be completed.

Tags: Deep Learning Machine Learning Mini Program TensorFlow

Posted by tylrwb on Thu, 14 Apr 2022 14:08:02 +0930