The TensorFlow calculation graph is composed of op and tensor, so what does tensor generally represent? Obviously, the input data of the model, the network weights, and the output results after the input data is processed by the op all need to be expressed in tensors or special tensors. Since tensor is so important in the TensorFlow architecture, this article will lead you to learn three topics of tensor from shallow to deep: tensor in the eyes of users, tensor in TensorFlow system, tensor high-level usage DLPack (cross-framework programming, such as: TensorFlow +PyTorch).

Note: This article is written based on TensorFlow v1.15.5.

## 1. Tensor in Xiaobai’s eyes

### 1.1 Tensor HelloWorld

Define two tensors, and then add them, the relevant code is as follows:

# segment 1 a = tf.constant(3.0, dtype=tf.float32) b = tf.constant(4.0) # also tf.float32 implicitly total = a + b print(a) print(b) print(total) ### The output of the three print s is as follows: """ Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32) Tensor("add:0", shape=(), dtype=float32) """ # Explanation: Tenosr at this time cannot produce real results yet. The above code creates a calculation graph, and Tensor just represents the result of op operation (but op is not running at this time).

If you want to see the final calculation result of total, you should create a Session object and run the calculation graph. The specific code is as follows (add code based on segment1):

with tf.Session() as sess: result = sess.run(total) print(result, type(result), type(total)) # output = 7.0 <class 'numpy.float32'> <class 'tensorflow.python.framework.ops.Tensor'>

It can be seen that Tensor represents the result that has not yet been executed. Create a Session object and run the calculation graph to get a total result of 7.0, and the data type of the result has been changed to numpy. Finally, let me explain that the Tensor output by the code in this section refers to tf.Tensor, and the corresponding code implementation is tensorflow.python.framework.ops.Tensor.

### 1.2 Tensor properties and special tensors

From the user's perspective, tf.Tensor has three main attributes: name, dtype, and shape. In addition, there are three more important attributes (not commonly used or not directly visible): op, graph, device. Among them, the op attribute records the operation name of this Tensor, the graph attribute record contains the calculation graph of this Tensor, and the device attribute records the device name of this Tensor.

There are four special tensors in the TensorFlow system (the Tensor is not strictly distinguished from the op that generates the Tensor here), as follows:

### 1.3 The relationship between Tensor and op

We have mentioned many times that Tensor can be used as the input of the op, and a new Tensor is generated as the output after a series of processing by the op. In order to understand this in depth, let's go back and re-examine the code snippet in segment1 (please pay attention to the naming of Tensor):

# segment 1 a = tf.constant(3.0, dtype=tf.float32) b = tf.constant(4.0) # also tf.float32 implicitly total = a + b print(a) print(b) print(total) ### The output of the three print s is as follows: """ Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32) Tensor("add:0", shape=(), dtype=float32) """ # Explanation: Tenosr at this time cannot produce real results yet. The above code creates a calculation graph, and Tensor just represents the result of op operation (but op is not running at this time).

For the above code, let's first look at which are Tensor and which are op, and then describe the execution process of each operation based on this. To answer the first question, let's look at an official TensorFlow note:

""" `tf.constant` creates a `Const` node in the computation graph with the exact value at graph construction time. """

It can be seen that there are two op s in the code of segment1, namely Const and add, the former appears twice, and the latter once. Based on this, we know that segment1 adds three ops to the calculation graph in turn, and at the same time can answer the second question, that is, the process of each operation. details as follows:

### The output of the three print s is as follows (a,b,total): """ Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32) Tensor("add:0", shape=(), dtype=float32) """ # Add the first op(Const) to the calculation graph, the input is a scalar, the output is Tensor a, and its name consists of two parts, namely the op name: a is at the index position of the op output. # Add a second op(Const_1, because the op name must be unique) to the calculation graph, input scalar, output Tensor b, and its naming rules are the same as above. # Add a third op(add) to the calculation graph, the input is Tensor a and b, and the output is Tensor total, and its naming rules are the same as above.

## Second, explore the tensor

### 2.1 Front-end and back-end Tensor mapping

In the TensorFlow white paper [7], it is mentioned that the C API is a bridge connecting the front-end user code and the back-end execution engine. To understand this concept in depth, readers are advised to refer to the TensorFlow official website to compile the source code from scratch. TensorFlow v1.15.5 is compiled based on Bazel, and the front-end python and back-end C++ interact through SWIG. In fact, the SWIG code generation process will be started before the system is compiled, and two wrapper files are automatically generated by parsing tensorflow.i: pywrap_tensorflow_internal.py and pywrap_tensorflow_internal.cc. The former is connected to the front-end python call, and the latter is connected to the back-end C API call. After you install the tensorflow official binary package, you can only see the py file but not the cc file. If you compile the TensorFlow source code yourself, you can find the corresponding py and cc files in bazel-bin under the project root directory, as shown in the following figure:

The so file in the red box above is compiled from the cc file. When the py module in the yellow box is imported for the first time, the so dynamic link library will be automatically loaded. In the cc file corresponding to so, a function mapping table is statically registered to realize the mapping from python functions to C functions. The structure of this mapping table is roughly as follows:

static PyMethodDef SwigMethods[] = { { (char *)"SWIG_PyInstanceMethod_New", (PyCFunction)SWIG_PyInstanceMethod_New, METH_O, NULL}, { (char *)"TF_OK_swigconstant", TF_OK_swigconstant, METH_VARARGS, NULL}, { (char *)"TF_CANCELLED_swigconstant", TF_CANCELLED_swigconstant, METH_VARARGS, NULL}, { (char *)"TF_UNKNOWN_swigconstant", TF_UNKNOWN_swigconstant, METH_VARARGS, NULL}, { (char *)"TF_INVALID_ARGUMENT_swigconstant", TF_INVALID_ARGUMENT_swigconstant, METH_VARARGS, NULL}, // A lot of code omitted here };

If you don't practice it yourself, the above text will be somewhat difficult to read. To make it easier for everyone to understand, we summarize the above text with the following diagram:

Some curious babies may say: The above is too macroscopic, it seems to understand, but it seems not to understand. It doesn't matter. Next, let's take the running interface session.run() of the static graph as an example, and combine the source code of TensorFlow to sort out the mapping process of the front-end and back-end in detail. The specific process is shown in the figure below:

From the figure above, we can clearly see that the C API layer separates the front and back ends. Of course, the C API layer includes pywrap_tensorflow_internal.h/cc, tf_session_helper.h/cc, and c_api.h/cc. So far, the process of mapping session.run() from the front-end to the back-end is finished, then answer how the front-end tensor is mapped to the back-end tensor, please see the following code:

// tf_session_helper.cc line351 void TF_SessionRun_wrapper_helper(TF_Session* session, const char* handle, const TF_Buffer* run_options, const std::vector<TF_Output>& inputs, const std::vector<PyObject*>& input_ndarrays, const std::vector<TF_Output>& outputs, const std::vector<TF_Operation*>& targets, TF_Buffer* run_metadata, TF_Status* out_status, std::vector<PyObject*>* py_outputs) { DCHECK_EQ(inputs.size(), input_ndarrays.size()); DCHECK(py_outputs != nullptr); DCHECK(py_outputs->empty()); Status s; // Convert input ndarray PyObjects to TF_Tensors. We maintain a continuous // array of TF_Tensor*s as well as scoped containers to make sure they're // cleaned up properly. // A lot of code is omitted. You can see that the object of the front-end class ndarray is converted into TF_Tensors. } // c_api.cc line2274 void TF_SessionRun(TF_Session* session, const TF_Buffer* run_options, const TF_Output* inputs, TF_Tensor* const* input_values, int ninputs, const TF_Output* outputs, TF_Tensor** output_values, int noutputs, const TF_Operation* const* target_opers, int ntargets, TF_Buffer* run_metadata, TF_Status* status) { // TODO(josh11b,mrry): Change Session to be able to use a Graph* // directly, instead of requiring us to serialize to a GraphDef and // call Session::Extend(). if (session->extend_before_run && !ExtendSessionGraphHelper(session, status)) { return; } TF_Run_Setup(noutputs, output_values, status); // Convert from TF_Output and TF_Tensor to a string and Tensor. // Look here, in addition TensorFlow converts TF_Tensor into c++ Tensor std::vector<std::pair<string, Tensor>> input_pairs(ninputs); if (!TF_Run_Inputs(input_values, &input_pairs, status)) return; for (int i = 0; i < ninputs; ++i) { input_pairs[i].first = OutputName(inputs[i]); } // Convert from TF_Output to string names. std::vector<string> output_names(noutputs); for (int i = 0; i < noutputs; ++i) { output_names[i] = OutputName(outputs[i]); } }

### 2.2 C++ Tensor class

Looking at reference 5, we found the definition of the C++ Tensor class, and its important fragment (seg1) is as follows:

class Tensor{ public: // Tensor serialization/deserialization related, detailed in Section 2.3 bool FromProto(const TensorProto& other) TF_MUST_USE_RESULT; void AsProtoField(TensorProto* proto) const; void AsProtoTensorContent(TensorProto* proto) const; // Tensor is actually a view of the underlying data, which can be displayed by vec or matrix template <typename T> typename TTypes<T>::Vec vec() { return tensor<T, 1>(); } template <typename T> typename TTypes<T>::Matrix matrix() { return tensor<T, 2>(); } template <typename T, size_t NDIMS> typename TTypes<T, NDIMS>::Tensor tensor(); private: TensorShape shape_; // Maintain Tensor shape and data type TensorBuffer buf_; // pointer to the underlying data }

Let's first analyze the next two private members. First look at the TensorBuffer class, which is a virtual class that inherits the reference counting class and does not contain any implementation. By looking at reference 6, we know that BufferBase inherits the TensorBuffer class and maintains a memory allocator pointer. The Buffer class inherits the BufferBase class, and maintains the pointer data_ and the number of elements elem_ pointing to the actual data. The inheritance relationship of the above classes is shown in the figure below (member definitions are given in the figure for ease of understanding, rather than standard UML diagrams):

Next we analyze the TensorShape class. It also has its own class inheritance system, and its core logic is defined in the parent class TensorShapeRep. The related class inheritance system is as follows:

In order to deeply understand the role of TensorShape, the following analysis is combined with part of the code (seg2) of TensorShapeRep:

class TensorShapeRep{ private: // The following buf represents TensorShape with a total of 16 bytes, of which the first 12 bytes are used to store the shape (Rep16, Rep32, Rep64) // The role of the 13th byte is not clear. The 14th, 15th, and 16th bytes respectively represent the data type number, the number of dimensions of the tensor, and the representation type of the tensor dimension union { uint8 buf[16]; Rep64* unused_aligner; // Force data to be aligned enough for a pointer. } u_; public: // In theory, tensors of any dimension can be defined, but 1-dimensional, 2-dimensional, and 3-dimensional tensors are the most common. Therefore, the following three dimension representation methods are given (12 bytes) struct Rep16 { uint16 dims_[6]; // Tensors of up to 6 dimensions can be represented, and the length of each dimension does not exceed 2^16-1 }; struct Rep32 { uint32 dims_[3]; // Can represent tensors of up to 3 dimensions, and the length of each dimension does not exceed 2^32-1 }; struct Rep64 { gtl::InlinedVector<int64, 4>* dims_; // Tensors of any dimension are supported }; }

At the end of this section, let's take a look at vector() and matrix() in the Tensor class definition. Looking at the implementation of the two methods, it is found that the common method tensor() is called, and the return type of tensor() is TTypes<T, NDIMS>::Tensor, and TTypes is the key to connecting TF Tensor and Eigen library. Please see the following code (seg3):

// tensorflow1.15.5\tensorflow\core\framework\tensor.h class Tensor{ public: // Returns the shape of the tensor. const TensorShape& shape() const { return shape_; } template <typename T> typename TTypes<T>::Vec vec() { return tensor<T, 1>(); } template <typename T> typename TTypes<T>::Matrix matrix() { return tensor<T, 2>(); } template <typename T, size_t NDIMS> typename TTypes<T, NDIMS>::Tensor tensor(); } // tensorflow1.15.5\tensorflow\core\framework\tensor_types.h template <typename T, int NDIMS = 1, typename IndexType = Eigen::DenseIndex> struct TTypes { // Rank-<NDIMS> tensor of scalar type T. typedef Eigen::TensorMap<Eigen::Tensor<T, NDIMS, Eigen::RowMajor, IndexType>,Eigen::Aligned> Tensor; // A lot of code omitted } // tensorflow1.15.5\tensorflow\core\framework\tensor.h // The shape() of TF Tensor returns TensorShape. base() returns a pointer to the actual data. template <typename T, size_t NDIMS> typename TTypes<T, NDIMS>::Tensor Tensor::tensor() { CheckTypeAndIsAligned(DataTypeToEnum<T>::v()); return typename TTypes<T, NDIMS>::Tensor(base<T>(), shape().AsEigenDSizes<NDIMS>()); }

It can be seen from the above code that calling tensor() converts TF Tensor into TTypes<T,NDIMS>::Tensor, which is essentially Eigen::TensorMap. So far, we have figured out the relationship between TF Tensor and Eigen library. We can think that TF C++ Tensor is a kind of encapsulation of Eigen::TensorMap. Because the parameters of the Eigen::TensorMap constructor come from the information stored in the TF Tensor (the information corresponding to base() and shape()).

### 2.3 C++ Tensor serialization

In the distributed training environment of TensorFlow, a large amount of cross-machine communication is involved, and the content of the communication is the serialized tensor (cooperative work through send/recv op pair). In this section, we will learn the serialization mechanism of Tensor and the mutual programming between Tensor and serialized objects. The serialized object corresponding to Tensor in TensorFlow is called TensorProto, which is generated from the corresponding proto file. The specific code is as follows (seg4):

// tensorflow1.15.5\tensorflow\core\framework\tensor.proto syntax = "proto3"; message TensorProto { DataType dtype = 1; TensorShapeProto tensor_shape = 2; int32 version_number = 3; bytes tensor_content = 4; repeated int32 half_val = 13 [packed = true]; // DT_FLOAT. repeated float float_val = 5 [packed = true]; // DT_DOUBLE. repeated double double_val = 6 [packed = true]; // DT_INT32, DT_INT16, DT_INT8, DT_UINT8. repeated int32 int_val = 7 [packed = true]; // DT_STRING repeated bytes string_val = 8; // DT_COMPLEX64. scomplex_val(2*i) and scomplex_val(2*i+1) are real // and imaginary parts of i-th single precision complex. repeated float scomplex_val = 9 [packed = true]; // DT_INT64 repeated int64 int64_val = 10 [packed = true]; // DT_BOOL repeated bool bool_val = 11 [packed = true]; // DT_COMPLEX128. dcomplex_val(2*i) and dcomplex_val(2*i+1) are real // and imaginary parts of i-th double precision complex. repeated double dcomplex_val = 12 [packed = true]; // DT_RESOURCE repeated ResourceHandleProto resource_handle_val = 14; // DT_VARIANT repeated VariantTensorDataProto variant_val = 15; // DT_UINT32 repeated uint32 uint32_val = 16 [packed = true]; // DT_UINT64 repeated uint64 uint64_val = 17 [packed = true]; };

You can use the protoc compiler to compile the tensor.proto file. As a result, two files, tensor.pb.h and tensor.pb.cc, are generated. They respectively declare the TensorProto class definition and the implementation of TensorProto member methods. We can roughly regard TensorProto as the binary object of Tensor. Based on this, the conversion code between them is as follows (seg5):

// Tensor serialization process auto tensor_proto = new TensorProto(); // Fills in `proto` with `*this` tensor's content. // `AsProtoField()` fills in the repeated field for `proto.dtype()`, // while `AsProtoTensorContent()` encodes the content in `proto.tensor_content()` in a compact form. tensor->AsProtoField(tensor_proto); tensor->AsProtoTensorContent(tensor_proto); // Tensor's deserialization process Tensor tensor; tensor.FromProto(tensor_proto);

## 3. Cross-framework programming - general memory tensor DLPack

### 3.1 What is DLPack

DLPack is an open memory tensor structure for sharing tensors between AI frameworks. The integration of multiple frameworks to solve AI problems can give full play to the advantages of each framework (some operations are better supported in a certain framework), and finally achieve the best overall performance. But there is a key problem to solve here: how to pass tensors in memory from one frame to another without any data copying? Fortunately, Chen Tianqi's team gave DLPack the answer.

The design concept of DLPack is to be as lightweight as possible. It does not consider memory allocation and device API, but only focuses on the tensor data structure. It can run on multiple hardware platforms, currently supported frameworks are: NumPy, CuPy, PyTorch, Tensorflow, MXNet, TVM, mpi4py. The developers of DLPack do not intend to implement Tensor s and Ops, but to use it as a common bridge to reuse tensors and operations across frameworks. To understand DLPack deeply, you need to master two modules: C API and Python API. The DLPack C API architecture is as follows:

The dark blue structures in the above figure are all defined in [13]. DLTensor represents a normal C Tensor object, but is not responsible for memory management. DLManagedTensor is also a C Tensor object, which is responsible for the memory management of DLTensor, and it is designed to help other frameworks borrow this DLTensor. Next, we turn our attention to DLPack's Python API.

The DLPack Python interface is the standard API for Python array s. There are two interfaces for data exchange with DLPack Python interface:

If you understand y=from_dlpack(x) from the semantic level, the library that generates x is called a producer, and the library that contains from_dlpack() is called a consumer. Among them, the producer provides a way to access the data field of x. Generally speaking, there is zero copy of the corresponding data between the producer and the consumer, that is, y can be regarded as a view of x. If you go deep into from_dlpack(x), the x.__dlpack__ method generates a PyCapsule object (or capsule) containing DLManagedTensor, which can only be consumed once. The producer must set the name of the PyCapsule object to "dltensor" to facilitate retrieval by name; at the same time, set the deleter method of DLManagedTensor to PyCapsule_Destructor. This setting is used when the capsule object named "dltensor" is no longer needed. The consumer transfers ownership of the DLManagedTensor from the capsule object to itself by renaming the capsule object to "used_dltensor" to ensure that PyCapsule_Destructor is not called. But when the capsule object transfers the ownership of DLManagedTensor to the consumer object, the destructor method of the consumer object can still call the deleter method of DLManagedTensor.

### 3.2 dlpack in TensorFlow

The author found that TensorFlow's support for DLPack started from v2.2.0, and earlier versions did not have a corresponding library for dlpack. TensorFlow's dlpack interface follows the same semantic description as 3.1, and the corresponding API test statements are as follows:

import tensorflow as tf x = tf.constant(5) x // <tf.Tensor: shape=(), dtype=int32, numpy=5> r =tf.experimental.dlpack.to_dlpack(x) print(r,type(r)) // <capsule object "dltensor" at 0x7f55a0431c30> <class 'PyCapsule'> x_other = tf.experimental.dlpack.from_dlpack(r) x_other // <tf.Tensor: shape=(), dtype=int32, numpy=5>

### 3.3 Relationship between TVM and DLPack

If you want to develop a deep learning compiler across AI frameworks, DLPack is a feasible solution (TVM is this technical route). For example, we declare and compile a matrix multiplication operator in TVM, and then build a wrapper based on the DLPack representation that enables this matrix multiplication operator to support PyTorch Tensor. A similar operation can be used for MxNet. The principle of DLPack providing an intermediate wrapper shared between the AI framework and TVM is shown in the following figure:

The above principle can refer to the following code example:

// Prerequisites:exist PyTorch Calculate matrix multiplication in import torch x = torch.rand(56,56) y = torch.rand(56,56) z = x.mm(y) // In the first step, define and construct a TVM matrix multiplication operator n = tvm.convert(56) X = tvm.placeholder((n,n), name='X') Y = tvm.placeholder((n,n), name='Y') k = tvm.reduce_axis((0, n), name='k') Z = tvm.compute((n,n), lambda i,j : tvm.sum(X[i,k]*Y[k,j], axis=k)) s = tvm.create_schedule(Z.op) fmm = tvm.build(s, [X, Y, Z], target_host='llvm', name='fmm') // step two, yes TVM function is wrapped to support PyTorch Tensor,and verify the result from tvm.contrib.dlpack import to_pytorch_func # fmm is the previously built TVM function (Python function) # fmm is the wrapped TVM function (Python function) fmm_pytorch = to_pytorch_func(fmm) z2 = torch.empty(56,56) fmm_pytorch(x, y, z2) np.testing.assert_allclose(z.numpy(), z2.numpy()) // The third step, refer to the second step for MxNet similar packaging import mxnet from tvm.contrib.mxnet import to_mxnet_func ctx = mxnet.cpu(0) x = mxnet.nd.uniform(shape=(56,56), ctx=ctx) y = mxnet.nd.uniform(shape=(56,56), ctx=ctx) z = mxnet.nd.empty(shape=(56,56), ctx=ctx) f = tvm.build(s, [X, Y, Z], target_host='llvm', name='f') f_mxnet = to_mxnet_func(f) f_mxnet(x, y, z) np.testing.assert_allclose(z.asnumpy(), x.asnumpy().dot(y.asnumpy())) // the fourth step, to_pytorch_func()detailed definition of // TVM provided dlpack tensor and TVM NDArray Interchange function.TVM The function called at the bottom level is TVM NDArray. // The approximate flow of this wrapper is: AI Tensor -> dlpack tensor -> TVM NDArray -> call TVM function def convert_func(tvm_func, tensor_type, to_dlpack_func): assert callable(tvm_func) def _wrapper(*args): args = tuple(ndarray.from_dlpack(to_dlpack_func(arg))\ if isinstance(arg, tensor_type) else arg for arg in args) return tvm_func(*args) return _wrapper def to_pytorch_func(tvm_func): import torch import torch.utils.dlpack return convert_func(tvm_func, torch.Tensor, torch.utils.dlpack.to_dlpack)

## Four. Summary

This article has a lot of content and brain-burning. It is recommended that readers read it several times, and they will surely gain something. Here we summarize the content of the whole article. This article mainly talks about three themes:

## references

1.TensorFlow Introduction: https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/low_level_intro.md

2.TensorFlow Tensors: https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/tensors.md

3.tf.constant source code: https://github.com/tensorflow/tensorflow/blob/v1.15.5/tensorflow/python/framework/constant_op.py#L165

4. framework-tensor of tensorflow source code analysis: https://www.cnblogs.com/jicanghai/p/9537282.html

5.TensorFlow c++ Tensor source code: https://github.com/tensorflow/tensorflow/blob/v1.15.5/tensorflow/core/framework/tensor.h

6.TensorFlow c++ Tensor source code: https://github.com/tensorflow/tensorflow/blob/v1.15.5/tensorflow/core/framework/tensor.cc

7.<TensorFlow: A System for Large-Scale Machine Learning>: https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf

8.tensorflow-internals.pdf: https://github.com/horance-liu/tensorflow-internals

9.DLPack doc: https://dmlc.github.io/dlpack/latest/

10.DLPack github: https://github.com/dmlc/dlpack

11.DLPack CAPI: https://dmlc.github.io/dlpack/latest/c_api.html

12.Python Specification for DLPack: https://dmlc.github.io/dlpack/latest/python_spec.html

13.dlpack.h: https://github.com/dmlc/dlpack/blob/main/include/dlpack/dlpack.h

14.Building a Cross-Framework Deep Learning Compiler via DLPack: https://tvm.apache.org/2018/08/10/DLPack-Bridge