WO2022241725A1

WO2022241725A1 - Convolution operation processing method, and electronic device, mobile terminal and storage medium

Info

Publication number: WO2022241725A1
Application number: PCT/CN2021/094948
Authority: WO
Inventors: 庄晨; 孟金涛; 魏彦杰
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2022-11-24

Abstract

A convolution operation processing method, and an electronic device, a mobile terminal and a storage medium. The processing method comprises: acquiring a convolution operation to be processed and a configuration database (S11); converting the convolution operation into a matrix multiplication, wherein the matrix multiplication corresponds to a convolution size (S12); if it is determined that in the configuration database, there is no configuration parameter that corresponds to the convolution size, defining a parameter search space according to the convolution size and a hardware parameter (S13); generating a plurality of operation codes according to a configuration parameter in the parameter search space, and calculating the matrix multiplication by using the plurality of operation codes, so as to obtain a plurality of operation results (S14); and storing, in the configuration database, a configuration parameter of an operation code corresponding to an operation result of the plurality of operation results that meets a preset condition, (S15). In this way, reconstruction optimization can be performed on a matrix multiplication, thereby improving a convolution operation by using a matrix multiplication having a better performance.

Description

卷积运算的处理方法、电子设备、移动终端及存储介质Processing method of convolution operation, electronic device, mobile terminal and storage medium

技术领域technical field

本申请涉及可重构技术领域，特别是涉及一种卷积运算的处理方法、电子设备、移动终端及存储介质。The present application relates to the field of reconfigurable technology, in particular to a convolution operation processing method, electronic equipment, mobile terminal and storage medium.

背景技术Background technique

近年来，大量的深度学***台以及使用数百万个摄像头的交通监控。在许多情况下，通常利用GPU群集、TPU群集进行训练的模型在边缘设备上部署使用，以提供实时的人工智能服务。In recent years, a large number of deep learning (DL) applications have gradually spread from the professional scientific field to the consumer market. Specific applications include real-time gaming robots, self-driving car navigation, VR social platforms, and traffic using millions of cameras. monitor. In many cases, models trained using GPU clusters and TPU clusters are deployed and used on edge devices to provide real-time artificial intelligence services.

卷积计算是人工智能服务中常用卷积神经网络(CNN)中主要的运算部分，它在许多网络模型的运算占比达到了99％以上。卷积计算可以通过转换成矩阵乘法，所以许多应用程序使用了BLAS(基本线性代数子例程)，手工编写的矩阵运算例程，甚至是扩展矩阵运算例程作为卷积计算的实现。Convolution calculation is the main calculation part of the convolutional neural network (CNN) commonly used in artificial intelligence services, and it accounts for more than 99% of the calculations of many network models. Convolution calculations can be converted to matrix multiplication, so many applications use BLAS (Basic Linear Algebra Subroutines), hand-written matrix operation routines, or even extended matrix operation routines as implementations of convolution calculations.

目前，卷积神经网络中生成的矩阵大多是长条形矩阵，而那些性能很好的BLAS计算库，基本都是针对正方形矩阵运算进行优化的，基于优化策略不一致，所以它们通常无法在这些长条形矩阵的计算上提供最佳的性能，矩阵乘法的性能也就无法得以较好的提升。At present, most of the matrices generated in the convolutional neural network are long strip matrices, and those BLAS computing libraries with good performance are basically optimized for square matrix operations, based on inconsistent optimization strategies, so they usually cannot be used in these long matrices. The calculation of bar matrix provides the best performance, and the performance of matrix multiplication cannot be better improved.

发明内容Contents of the invention

本申请实施例的第一方面提供了卷积运算的处理方法，该处理方法包括：获取待处理的卷积运算以及配置数据库；将卷积运算转换为矩阵乘法，矩阵乘法对应一卷积尺寸；若确定配置数据库中无卷积尺寸对应的配置参数，则根据卷积尺寸以及硬件参数定义一参数搜索空间；根据参数搜索空间中的配置参数生成多个运算代码，并利用多个运算代码对矩阵乘法进行计算，以得到多个运算结果；将多个运算结果中满足预设条件的一个运算结果对应的运算代码的配置参数，存储至配置数据库。The first aspect of the embodiment of the present application provides a convolution operation processing method, the processing method includes: obtaining the convolution operation to be processed and configuring the database; converting the convolution operation into matrix multiplication, and the matrix multiplication corresponds to a convolution size; If it is determined that there is no configuration parameter corresponding to the convolution size in the configuration database, then define a parameter search space according to the convolution size and hardware parameters; generate multiple operation codes according to the configuration parameters in the parameter search space, and use the multiple operation codes to matrix Multiplication is performed to obtain multiple calculation results; the configuration parameters of the calculation code corresponding to one calculation result satisfying the preset condition among the multiple calculation results are stored in the configuration database.

本申请实施例的第二方面提供了一种移动终端，包括：处理器和存储器，存储器中存储有计算机程序，处理器用于执行计算机程序以实现本申请实施例第一方面提供的处理方法。The second aspect of the embodiment of the present application provides a mobile terminal, including: a processor and a memory, where a computer program is stored in the memory, and the processor is used to execute the computer program to implement the processing method provided in the first aspect of the embodiment of the present application.

本申请实施例的第三方面提供了一种计算机可读存储介质，该计算机可读存储介质存储有计算机程序，计算机程序能够被处理器执行时实现本申请实施例第一方面提供的处理方法。A third aspect of the embodiments of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program can implement the processing method provided in the first aspect of the embodiments of the present application when executed by a processor.

本申请的有益效果是：区别于现有技术的情况，本申请针对目前卷积运算的处理方法，通过确定配置数据库中没有对应的配置参数，根据卷积尺寸以及硬件参数定义一参数搜索空间，从而在根据参数搜索空间中的配置参数对矩阵乘法进行重构优化，生成多个运算代码，并利用多个运算代码对矩阵乘法进行计算，以得到多个运算结果，进而以提升矩阵乘法对卷积运算的性能。The beneficial effects of the present application are: different from the situation of the prior art, the present application aims at the current convolution operation processing method, by determining that there is no corresponding configuration parameter in the configuration database, and defining a parameter search space according to the convolution size and hardware parameters, Therefore, according to the configuration parameters in the parameter search space, the matrix multiplication is reconstructed and optimized, multiple operation codes are generated, and the matrix multiplication is calculated by using the multiple operation codes to obtain multiple operation results, and then the volume of matrix multiplication is improved. The performance of the product operation.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.

图1是本申请卷积运算的处理方法第一实施例的流程示意图；Fig. 1 is a schematic flow chart of the first embodiment of the processing method of the convolution operation of the present application;

图2是图1中步骤S13一具体实施例的流程示意图；Fig. 2 is a schematic flow chart of a specific embodiment of step S13 in Fig. 1;

图3是图2中步骤S23一具体实施例的流程示意图；Fig. 3 is a schematic flow chart of a specific embodiment of step S23 in Fig. 2;

图4是图1中步骤S14一具体实施例的流程示意图；Fig. 4 is a schematic flow chart of a specific embodiment of step S14 in Fig. 1;

图5是图1中步骤S15一具体实施例的流程示意图；Fig. 5 is a schematic flow chart of a specific embodiment of step S15 in Fig. 1;

图6是本申请卷积运算的处理方法一具体实施例的矩阵框架示意图；FIG. 6 is a schematic diagram of a matrix framework of a specific embodiment of a processing method for convolution operations in the present application;

图7是本申请矩阵的分块结构示意图；Fig. 7 is a block diagram of the application matrix;

图8是本申请卷积运算的处理方法一具体实施例的流程示意图；Fig. 8 is a schematic flow chart of a specific embodiment of a processing method of convolution operation of the present application;

图9是本申请卷积运算的处理方法的结果示意图；Fig. 9 is a schematic diagram of the results of the processing method of the convolution operation of the present application;

图10是本申请的移动终端一实施例的示意框图；FIG. 10 is a schematic block diagram of an embodiment of a mobile terminal of the present application;

图11是本申请的计算机可读存储介质一实施例的示意框图；Fig. 11 is a schematic block diagram of an embodiment of a computer-readable storage medium of the present application;

具体实施方式Detailed ways

以下描述中，为了说明而不是为了限定，提出了诸如特定***结构、技术之类的具体细节，以便透彻理解本申请实施例。然而，本领域的技术人员应当清楚，在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中，省略对众所周知的***、装置、电路以及方法的详细说明，以免不必要的细节妨碍本申请的描述。In the following description, specific details such as specific system structures and technologies are presented for the purpose of illustration rather than limitation, so as to thoroughly understand the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

应当理解，当在本说明书和所附权利要求书中使用时，术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在，但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the term "comprising" indicates the presence of described features, integers, steps, operations, elements and/or components, but does not exclude one or more other features. , whole, step, operation, element, component and/or the presence or addition of a collection thereof.

还应当理解，在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样，除非上下文清楚地指明其它情况，否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terminology used in the specification of this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and the appended claims, the singular forms "a", "an" and "the" are intended to include plural referents unless the context clearly dictates otherwise.

还应当进一步理解，在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合，并且包括这些组合。It should also be further understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .

如在本说明书和所附权利要求书中所使用的那样，术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地，短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and the appended claims, the term "if" may be construed as "when" or "once" or "in response to determining" or "in response to detecting" depending on the context . Similarly, the phrase "if determined" or "if [the described condition or event] is detected" may be construed, depending on the context, to mean "once determined" or "in response to the determination" or "once detected [the described condition or event] ]” or “in response to detection of [described condition or event]”.

在2017年，整个手机消费市场就售出了15亿部手机。腾讯信标报告称，2019年第二季度在线活跃用户/移动设备的数量为682,956,170台。假设每台移动设备的平均计算性能为50GFlops，那么腾讯报告中提到的活跃移动设备的总体理论峰值性能之和将超过世界上最快的基于ARM架构实现的超级计算机Fugaku。In 2017, the entire mobile phone consumer market sold 1.5 billion mobile phones. Tencent Beacon reported that the number of online active users/mobile devices in the second quarter of 2019 was 682,956,170 units. Assuming that the average computing performance of each mobile device is 50GFlops, the sum of the overall theoretical peak performance of the active mobile devices mentioned in the Tencent report will exceed Fugaku, the world's fastest supercomputer based on the ARM architecture.

卷积计算是常用卷积神经网络(CNN)中主要的运算部分，它在许多网络模型的运算占比达到了99％以上，如下表表1所示：Convolution calculation is the main operation part of the commonly used convolutional neural network (CNN). It accounts for more than 99% of the operations of many network models, as shown in Table 1 below:

表1深度学习常见CNN网络模型中卷积计算耗时占比Table 1 Time-consuming ratio of convolution calculation in common CNN network models for deep learning

在移动计算的环境中，基于ARM架构的CPU是移动设备中使用的主要硬件体系结构，它是探索当前神经网络部署的最佳解决方案的合适且实用的硬件平台。每年有数十家授权供应商通过修改ARM架构的缓存大小，内存类型，指令CPI或指令集来制造数十种不同类型的ARM SoC。因此，如果深度学习应用要充分利用给定设备中的硬件资源，那么应用性能可移植性也是一个挑战。当深度学习应用要使用ARM SoC上的后端计算库为深度学习模型提供服务时，它们就必须解决“应用性能可移植性”这一问题。In the context of mobile computing, CPUs based on the ARM architecture are the main hardware architecture used in mobile devices, and it is a suitable and practical hardware platform to explore the best solutions for current neural network deployments. Every year, dozens of authorized suppliers manufacture dozens of different types of ARM SoCs by modifying the cache size, memory type, instruction CPI or instruction set of the ARM architecture. Therefore, application performance portability is also a challenge if deep learning applications are to fully utilize the hardware resources in a given device. When deep learning applications are to serve deep learning models using backend compute libraries on ARM SoCs, they must address the issue of "application performance portability".

对于数十亿种具有数百种硬件规格的ARM SoC而言，性能移植的生产力是部署深度学习模型的另一个挑战。ARM已经发布了10种Cortex-M和16种Cortex-A/X系列架构，而Apple和其他供应商已经发布了基于ARM架构的37种架构。因此，通过手动调优这一方式来覆盖所有ARM硬件架构的矩阵运算库是不经济的。例如，OpenBLAS的作者在将这一高效矩阵运算库移植到13种不同的ARM体系结构之后，对于在2016年发布的Cortex-A73以及之后发布的设备上停止了移植工作。Productivity in performance porting is another challenge in deploying deep learning models to billions of ARM SoCs with hundreds of hardware variants. ARM has released 10 Cortex-M and 16 Cortex-A/X series architectures, while Apple and other suppliers have released 37 architectures based on the ARM architecture. Therefore, it is uneconomical to cover the matrix operation library of all ARM hardware architectures by manual tuning. For example, after the authors of OpenBLAS ported this efficient matrix operation library to 13 different ARM architectures, they stopped porting work on the Cortex-A73 and later devices released in 2016.

因此，在这种情况下，本申请提出了一种卷积运算的处理方法，实现高效且自动化的ARM设备性能移植方式对于设计新的矩阵运算库就至关重要。请参阅图1，图1是本申请卷积运算的处理方法第一实施例的流程示意图，具体包括以下步骤：Therefore, in this case, the application proposes a processing method of convolution operation, and realizing an efficient and automatic ARM device performance transplantation method is crucial for designing a new matrix operation library. Please refer to FIG. 1. FIG. 1 is a schematic flow chart of the first embodiment of the processing method of the convolution operation of the present application, which specifically includes the following steps:

S11：获取待处理的卷积运算以及配置数据库；S11: Obtain the convolution operation to be processed and configure the database;

一般来讲，卷积运算通常有三部分组成，其中两部分是卷积，一部分是两卷积对应的卷积计算方法，通过获取待处理的卷积运算，可以实现对卷积运算的预备处理工作。Generally speaking, the convolution operation usually consists of three parts, two of which are convolution, and one part is the convolution calculation method corresponding to the two convolutions. By obtaining the convolution operation to be processed, the preparatory work for the convolution operation can be realized. .

通常，往往在本地储存有配置数据库，配置数据库里有对应的卷积计算方法，用于在进行卷积运算时，通过获取配置数据库，可以为卷积后面的转换进行进一步地计算处理。Usually, a configuration database is often stored locally, and there is a corresponding convolution calculation method in the configuration database, which is used to perform further calculation processing for the conversion after the convolution by obtaining the configuration database during the convolution operation.

其中，可以先获取待处理的卷积运算，通过卷积来获取配置数据库。另外，也可以同时获取待处理的卷积运算以及获取配置数据库，本领域技术人员也可以再对待处理的卷积运算进行一部分处理，然后在获取配置数据库，具体根据需求进行获取，此处不做限定。Wherein, the convolution operation to be processed may be obtained first, and the configuration database may be obtained through convolution. In addition, the convolution operation to be processed and the configuration database can also be obtained at the same time. Those skilled in the art can also perform a part of the processing on the convolution operation to be processed, and then obtain the configuration database according to the specific needs, which will not be done here limited.

S12：将卷积运算转换为矩阵乘法，矩阵乘法对应一卷积尺寸；S12: Convert the convolution operation into matrix multiplication, and the matrix multiplication corresponds to a convolution size;

获取待处理的卷积运算后，则可以将卷积运算转换为矩阵乘法，因为卷积通常对应有卷积尺寸，所以转换后的矩阵乘法也对应一卷积尺寸。After obtaining the convolution operation to be processed, the convolution operation can be converted into a matrix multiplication, because convolution usually corresponds to a convolution size, so the converted matrix multiplication also corresponds to a convolution size.

另外，先获取待处理的卷积运算，通过卷积来获取配置数据库，具体可以是通过卷积的尺寸来获取配置数据库中的卷积计算方法，从而对卷积运算转换为矩阵乘法。In addition, the convolution operation to be processed is obtained first, and the configuration database is obtained through convolution. Specifically, the convolution calculation method in the configuration database can be obtained through the size of the convolution, so that the convolution operation is converted into matrix multiplication.

具体地，可以对卷积运算执行Im2col算法，以将卷积运算转换成卷积对应的矩阵乘法计算，因为卷积通常对应有卷积尺寸，所以转换后的矩阵乘法也对应一卷积尺寸。Specifically, the Im2col algorithm can be performed on the convolution operation to convert the convolution operation into a matrix multiplication calculation corresponding to the convolution. Because convolution usually corresponds to a convolution size, the converted matrix multiplication also corresponds to a convolution size.

通常，在只有一个通道的情况下，Im2col算法是按照从左到右、从上到下的过程，将第一矩阵拉成一列，形成新的矩阵，若有多个通道，则可以先按照一个通道进行转换，然后依照相类似的方法进行转换。Usually, in the case of only one channel, the Im2col algorithm pulls the first matrix into a column according to the process from left to right and from top to bottom to form a new matrix. If there are multiple channels, you can first follow one Channels are converted, and then converted in a similar way.

S13：若确定配置数据库中无卷积尺寸对应的配置参数，则根据卷积尺寸以及硬件参数定义一参数搜索空间；S13: If it is determined that there is no configuration parameter corresponding to the convolution size in the configuration database, define a parameter search space according to the convolution size and hardware parameters;

通常，配置数据库中会有卷积尺寸对应的配置参数，可以用于直接得到由经验而来的最优参数组合，从而生成代码对转换的矩阵乘法进行计算，进而得到计算结果。Usually, there are configuration parameters corresponding to the convolution size in the configuration database, which can be used to directly obtain the optimal parameter combination from experience, thereby generating code to calculate the transformed matrix multiplication, and then obtain the calculation result.

随着数据中心和移动设备中深度学习应用程序种类的增加，矩阵运算中矩阵的形状已发生了巨大的变化。此外，各种新研发的SoC也已投放到市场中。越来越多的具有不同体系结构的SoC和各式各样的深度学习应用使得配置数据库中没有卷积尺寸对应的配置参数，同时也加剧了软件开发人员难以支持和优化现有的矩阵运算库。As the variety of deep learning applications in data centers and mobile devices has increased, the shape of matrices in matrix operations has changed dramatically. In addition, various newly developed SoCs have also been released to the market. More and more SoCs with different architectures and various deep learning applications make the configuration parameters corresponding to the convolution size not available in the configuration database, and it also intensifies the difficulty for software developers to support and optimize the existing matrix operation library. .

大量不同硬件配置和不同矩阵形状下，若确定配置数据库中无卷积尺寸对应的配置参数，则根据卷积尺寸以及硬件参数定义一参数搜索空间，用于存放配置参数以及为矩阵乘法提供运算的空间。Under a large number of different hardware configurations and different matrix shapes, if it is determined that there is no configuration parameter corresponding to the convolution size in the configuration database, a parameter search space is defined according to the convolution size and hardware parameters, which is used to store configuration parameters and provide operations for matrix multiplication space.

S14：根据参数搜索空间中的配置参数生成多个运算代码，并利用多个运算代码对矩阵乘法进行计算，以得到多个运算结果；S14: Generate multiple operation codes according to the configuration parameters in the parameter search space, and use the multiple operation codes to calculate matrix multiplication to obtain multiple operation results;

由于对于多个卷积来将，可以对应有多个矩阵，每个卷积有匹配的自身性质，所以对应的矩阵形成各种配置参数，存放于参数搜索空间中。Since multiple convolutions can correspond to multiple matrices, and each convolution has its own matching properties, the corresponding matrices form various configuration parameters and are stored in the parameter search space.

因为每种配置参数有预设的取值范围，所以根据确定取值可以确定出一个具体的配置参数组合，从而根据参数搜索空间中的配置参数生成运算代码，通过利用该运算代码对矩阵乘法进行计算，则可以得到对应的运算结果。Because each configuration parameter has a preset value range, a specific configuration parameter combination can be determined according to the determined value, so that the operation code is generated according to the configuration parameters in the parameter search space, and the matrix multiplication is performed by using the operation code. calculation, the corresponding operation result can be obtained.

当配置参数的组合有多个时，则可以生成多个运算代码，则可以利用多个运算代码对矩阵乘法进行计算，以得到多个运算结果，当然这些运算结果可能会一样，也可能不一样。When there are multiple combinations of configuration parameters, multiple operation codes can be generated, and multiple operation codes can be used to calculate matrix multiplication to obtain multiple operation results. Of course, these operation results may or may not be the same .

S15：将多个运算结果中满足预设条件的一个运算结果对应的运算代码的配置参数，存储至配置数据库。S15: Store the configuration parameters of the operation code corresponding to one operation result satisfying the preset condition among the plurality of operation results into the configuration database.

为了挑选最优质的配置参数，所以需要对多个运算结果设置条件限制，用于挑选达到预设条件的运算结果，具体可以设置预设条件，预设条件可以是优化后矩阵乘法运行的时间，也可以是优化后矩阵乘法运行所得到的性能误差等。In order to select the best quality configuration parameters, it is necessary to set conditional restrictions on multiple operation results to select operation results that meet the preset conditions. Specifically, preset conditions can be set. The preset conditions can be the running time of matrix multiplication after optimization. It can also be the performance error obtained by the operation of matrix multiplication after optimization, etc.

在多个运算结果中有一个运算结果满足预设条件时，则可以将多个运算结果中满足预设条件的一个运算结果对应的运算代码的配置参数，存储至配置数据库，用于配置数据库的自我更新以及自优化。When one of the multiple calculation results satisfies the preset condition, the configuration parameters of the calculation code corresponding to one of the multiple calculation results that meet the preset condition can be stored in the configuration database for use in configuring the database. Self-updating and self-optimizing.

因此，本申请针对目前卷积运算的处理方法，通过确定配置数据库中没有对应的配置参数，根据卷积尺寸以及硬件参数定义一参数搜索空间，从而在根据参数搜索空间中的配置参数对矩阵乘法进行重构优化，生成多个运算代码，并利用多个运算代码对矩阵乘法进行计算，以得到多个运算结果，进而以提升矩阵乘法对卷积运算的性能。Therefore, this application aims at the current processing method of convolution operation, by determining that there is no corresponding configuration parameter in the configuration database, and defining a parameter search space according to the convolution size and hardware parameters, so that matrix multiplication is performed according to the configuration parameters in the parameter search space Perform refactoring and optimization, generate multiple operation codes, and use multiple operation codes to calculate matrix multiplication to obtain multiple operation results, thereby improving the performance of matrix multiplication for convolution operations.

更进一步地，若确定配置数据库中无卷积尺寸对应的配置参数，则根据卷积尺寸以及硬件参数定义一参数搜索空间，请参阅图2，图2是图1中步骤S13一具体实施例的流程示意图，具体包括以下步骤：Furthermore, if it is determined that there is no configuration parameter corresponding to the convolution size in the configuration database, a parameter search space is defined according to the convolution size and hardware parameters, please refer to FIG. 2, which is a specific embodiment of step S13 in FIG. 1 Schematic diagram of the process, specifically including the following steps:

S21：判断配置数据库中是否有卷积尺寸对应的配置参数；S21: Determine whether there is a configuration parameter corresponding to the convolution size in the configuration database;

若获取的待处理卷积运算中是以前获取过的或者卷积尺寸与以前获取过得卷积运算一致，那么说明配置数据库中有卷积尺寸对应的配置参数，若获取的待处理卷积运算中的卷积尺寸改变了或不一致，则说明也没有对应的配置参数。通过判断即可知道。If the obtained convolution operation to be processed is obtained before or the convolution size is consistent with the previously obtained convolution operation, it means that there are configuration parameters corresponding to the convolution size in the configuration database. If the obtained convolution operation to be processed If the convolution size in is changed or inconsistent, it means that there is no corresponding configuration parameter. Know by judgment.

若配置数据库中有卷积尺寸对应的配置参数，表示配置数据库中有最优的配置参数，可以完成高效的矩阵乘法运算，则进入步骤S22：按照配置参数生成运算代码并进行计算得到运算结果。If there are configuration parameters corresponding to the convolution size in the configuration database, it means that there are optimal configuration parameters in the configuration database, and efficient matrix multiplication operations can be completed, then enter step S22: generate operation codes according to the configuration parameters and perform calculations to obtain operation results.

若配置数据库中无卷积尺寸对应的配置参数，表示配置数据库中没有最优的配置参数，可以完成矩阵乘法运算需要不断的寻找和优化才能完成，则进入步骤S23：根据卷积尺寸以及硬件参数定义配置参数对应的一参数搜索空间。If there is no configuration parameter corresponding to the convolution size in the configuration database, it means that there is no optimal configuration parameter in the configuration database, and the matrix multiplication operation needs to be continuously searched and optimized to complete, then enter step S23: according to the convolution size and hardware parameters Defines a parameter search space corresponding to configuration parameters.

卷积尺寸对应的配置参数至少包括第一矩阵A的行数M、第一矩阵A的列数K、第一矩阵A的缓存块的行数mc、第一矩阵A的缓存块的列数kc、第二矩阵B的列数N、第二矩阵B的缓存块的列数nc、寄存器块的行数m_reg、寄存器块的列数n_rreg、第一矩阵A的预取值pre_a、第二矩阵B的预取值pre_b以及搜索空间标签loopReorder。The configuration parameters corresponding to the convolution size include at least the number of rows of the first matrix A M, the number of columns of the first matrix A K, the number of rows of the cache block of the first matrix A mc, the number of columns of the cache block of the first matrix A kc , the column number N of the second matrix B, the column number nc of the cache block of the second matrix B, the row number m_reg of the register block, the column number n_rreg of the register block, the prefetch value pre_a of the first matrix A, the second matrix B The prefetch value pre_b and the search space label loopReorder.

具体地，第一矩阵A的缓存块的行数的取值范围为[8，max(M，1024)]，M为第一矩阵A的行数，第一矩阵A的缓存块的列数的取值范围为[8，max(K，1024)]，K为第一矩阵A的列数，第二矩阵B的缓存块的列数的取值范围为[8，max(N，1024)]，N为第二矩阵的行数，寄存器块的行数m_reg的取值范围为4或8，寄存器块的列数n_rreg为8、12或16，第一矩阵A的预取值pre_a以及第二矩阵B的预取值pre_b至少包括0、32、64、128、256或512之一，搜索空间标签取值loopReorder至少包括0、1、2或3，具体可以参看表2。Specifically, the value range of the number of rows of the cache block of the first matrix A is [8, max(M, 1024)], M is the number of rows of the first matrix A, and the number of columns of the cache block of the first matrix A is The value range is [8, max(K, 1024)], K is the column number of the first matrix A, and the value range of the column number of the buffer block of the second matrix B is [8, max(N, 1024)] , N is the row number of the second matrix, the value range of the row number m_reg of the register block is 4 or 8, the column number n_rreg of the register block is 8, 12 or 16, the prefetch value pre_a of the first matrix A and the second The prefetch value pre_b of matrix B includes at least one of 0, 32, 64, 128, 256, or 512, and the search space label value loopReorder includes at least 0, 1, 2, or 3. See Table 2 for details.

表2可重配置矩阵乘法库的运行时参数和搜索空间Table 2 Runtime parameters and search space of the reconfigurable matrix multiplication library

运行时参数runtime parameters	定义definition	取值范围Ranges
Mm	No.rows of matrix ANo.rows of matrix A	the
NN	No.cols of matrix BNo. cols of matrix B	the
KK	No.cols of matrix ANo. cols of matrix A	the
mcmc	No.rows of cache block of matrix ANo.rows of cache block of matrix A	[8,max(M,1024)][8,max(M,1024)]
ncnc	No.cols of cache block of matrix BNo.cols of cache block of matrix B	[8,max(N,1024)][8,max(N,1024)]
kckc	No.cols of cache block of matrix ANo. cols of cache block of matrix A	[8,max(K,1024)][8,max(K,1024)]
m_regm_reg	No.rows of register blockNo.rows of register block	4,84,8
n_rregn_rreg	No.cols of register blockNo. cols of register block	8,12,168,12,16
pre_apre_a	prefetch size of matrix Aprefetch size of matrix A	0,32,64,128,256,5120,32,64,128,256,512
pre_bpre_b	prefetch size of matrix Bprefetch size of matrix B	0,32,64,128,256,5120,32,64,128,256,512
loopReorderloopReorder	loop reorder tag loop reorder tag	0,1,2,30,1,2,3

更进一步地，若配置数据库中无卷积尺寸对应的配置参数，则根据卷积尺寸以及硬件参数定义配置参数对应的一参数搜索空间，请参阅图3，图3是图2中步骤S23一具体实施例的流程示意图，具体包括以下步骤：Furthermore, if there is no configuration parameter corresponding to the convolution size in the configuration database, then a parameter search space corresponding to the configuration parameter is defined according to the convolution size and hardware parameters, please refer to FIG. 3 , which is a specific step S23 in FIG. 2 The schematic flow chart of the embodiment specifically includes the following steps:

S31：根据硬件参数配置卷积尺寸对应的多组参数组合，得到配置参数；S31: Configure multiple sets of parameter combinations corresponding to the convolution size according to the hardware parameters to obtain configuration parameters;

硬件参数的不同，使得卷积尺寸对应的多组参数组合不同，为了更好的得到配置参数，使得之后的矩阵乘法能更好地运行，则可以根据硬件参数配置卷积尺寸对应的多组参数组合，得到配置参数。The difference in hardware parameters makes the combination of multiple sets of parameters corresponding to the convolution size different. In order to better obtain the configuration parameters and make the subsequent matrix multiplication run better, you can configure multiple sets of parameters corresponding to the convolution size according to the hardware parameters. combination to get configuration parameters.

S32：从多组参数组合中选取一组；S32: Select one group from multiple groups of parameter combinations;

为了使得矩阵乘法的运算更快捷，可以从多组参数组合中首先选取一组来对矩阵乘法进行运算，当然，也可以从多组参数组合中选取多组，在允许的情况下比如硬件参数对应的运算空间允许，在多组参数组合中对矩阵乘法进行计算，具体可以更具需求进行设置，此处不做限定。In order to make the operation of matrix multiplication faster, you can first select one group from multiple sets of parameter combinations to perform matrix multiplication operations. Of course, you can also select multiple groups from multiple sets of parameter combinations. If allowed, such as hardware parameter correspondence The calculation space allows to calculate the matrix multiplication in multiple sets of parameter combinations, which can be set according to the needs, and there is no limitation here.

S33：基于选取的一组参数组合，定义对应的参数搜索空间。S33: Based on the selected set of parameter combinations, define a corresponding parameter search space.

因为每一组参数组合可以定义对应的参数搜索空间，则可以基于选取的一组参数组合，定义对应的参数搜索空间；若基于选取的多组参数组合，则可以同时定义对应的多组参数搜索空间。Because each set of parameter combinations can define the corresponding parameter search space, the corresponding parameter search space can be defined based on the selected set of parameter combinations; if based on the selected multiple sets of parameter combinations, the corresponding multiple sets of parameter search spaces can be defined at the same time space.

更进一步地，在参数搜索空间生成多个运算代码，并利用多个运算代码对矩阵乘法进行计算，以得到多个运算结果，请参阅图4，图4是图1中步骤S14一具体实施例的流程示意图，具体包括以下步骤：Furthermore, multiple operation codes are generated in the parameter search space, and multiple operation codes are used to calculate matrix multiplication to obtain multiple operation results. Please refer to FIG. 4, which is a specific embodiment of step S14 in FIG. 1 Schematic diagram of the process, which specifically includes the following steps:

S41：基于选取的参数组合，在参数搜索空间中，生成卷积对应的多个运算代码；S41: Based on the selected parameter combination, generate a plurality of operation codes corresponding to the convolution in the parameter search space;

在参数搜索空间中，因为每一组参数组合可以生成卷积对应的一个运算代码，所以多组参数组合可以生成卷积对应的多个运算代码，用于对同一矩阵乘法进行运算。In the parameter search space, because each set of parameter combinations can generate an operation code corresponding to convolution, multiple sets of parameter combinations can generate multiple operation codes corresponding to convolution, which are used to perform operations on the same matrix multiplication.

S42：利用多个运算代码，对矩阵乘法进行计算，得到第I运算结果，其中I为大于1的正整数且小于或等于运算代码的个数。S42: Calculate the matrix multiplication by using a plurality of operation codes to obtain an Ith operation result, where I is a positive integer greater than 1 and less than or equal to the number of operation codes.

利用一个运算代码，对矩阵乘法进行计算，得到一个运算结果；利用多个运算代码，对矩阵乘法进行计算，得到多个运算结果，用第I运算结果表示，其中I为大于1的正整数且小于或等于运算代码的个数。Utilize one operation code, matrix multiplication is calculated, obtain an operation result; Utilize a plurality of operation codes, matrix multiplication is calculated, obtain a plurality of operation results, represent with the Ith operation result, wherein I is a positive integer greater than 1 and Less than or equal to the number of operation codes.

具体地，可以将多个运算代码按堆栈的方式进行排序，分别对矩阵乘法进行计算，得到多个运算结果，也可以同时使用多个运算代码，对矩阵乘法进行计算，同时得到多个运算结果。Specifically, multiple operation codes can be sorted in stacks, and matrix multiplication can be calculated separately to obtain multiple operation results, or multiple operation codes can be used at the same time to calculate matrix multiplication, and multiple operation results can be obtained at the same time .

更进一步地，将多个运算结果中满足预设条件的一个运算结果对应的运算代码的配置参数，存储至配置数据库，请参阅图5，图5是图1中步骤S15一具体实施例的流程示意图，具体包括以下步骤：Furthermore, the configuration parameters of the operation code corresponding to one operation result that satisfies the preset condition among the plurality of operation results are stored in the configuration database, please refer to FIG. 5 , which is a flow chart of a specific embodiment of step S15 in FIG. 1 The schematic diagram specifically includes the following steps:

S51：判断第一运算结果和/或第I运算结果是否满足预设条件，预设条件至少包括矩阵乘法计算的时间段为多个运算结果中最短；S51: Judging whether the first operation result and/or the first operation result meet the preset condition, the preset condition at least includes that the time period for matrix multiplication calculation is the shortest among the multiple operation results;

为对第I运算结果进行筛选，所以设置了预设条件，该预设条件可以更具需要进行设置，具体在此处可以至少包括矩阵乘法计算的时间段为多个运算结果中最短，也就是说，第I运算结果是对矩阵乘法执行运算代码所得来的运算时间段，而预设条件是在多个运算时间段内找到最短的那个运算时间段，以此对第I运算结果进行挑选。In order to filter the result of the first operation, a preset condition is set, which can be set more as needed, specifically here it can at least include that the time period for matrix multiplication calculation is the shortest among multiple calculation results, that is, That is to say, the first operation result is the operation time period obtained by executing the operation code for matrix multiplication, and the preset condition is to find the shortest operation time period among multiple operation time periods, so as to select the first operation result.

若第一运算结果和/或第I运算结果满足预设条件，则进入步骤S52：将第一运算结果和/或第I运算结果对应的配置参数存储至配置数据库；若第一运算结果和/或第I运算结果不满足预设条件，则进入步骤S53：舍弃第I运算结果对应的配置参数，并将第一运算结果对应的配置参数存储至配置数据库。If the first calculation result and/or the first calculation result meet the preset condition, then enter step S52: store the configuration parameters corresponding to the first calculation result and/or the first calculation result to the configuration database; if the first calculation result and/or Or the first calculation result does not meet the preset condition, then enter step S53: discard the configuration parameters corresponding to the first calculation result, and store the configuration parameters corresponding to the first calculation result into the configuration database.

具体地，在筛选的过程种，也可以通过依次对比的方式对三个运算结果进行挑选，比如第一运算结果表示运算时间为10秒，第二运算结果表示运算时间为3秒，第三运算结果表示运算时间为5秒，10秒<3秒>5秒，也就是说第二运算结果对应的配置参数是最优的，当然还可以有多种比较的方式，可更需具体需求进行选择，此处不做限定。通常经过对矩阵乘法的重构以及自优化，配置数据库对应的可重配置的矩阵乘法计算库在不同卷积层上的性能可以提高2％～17％。Specifically, in the screening process, the three calculation results can also be selected by sequential comparison. For example, the first calculation result indicates that the calculation time is 10 seconds, the second calculation result indicates that the calculation time is 3 seconds, and the third calculation result indicates that the calculation time is 3 seconds. The result shows that the calculation time is 5 seconds, and 10 seconds < 3 seconds > 5 seconds, which means that the configuration parameters corresponding to the second calculation result are optimal. Of course, there are many ways to compare, and you can choose according to your specific needs. , is not limited here. Usually, through the reconstruction and self-optimization of matrix multiplication, the performance of the reconfigurable matrix multiplication computing library corresponding to the configuration database can be improved by 2% to 17% on different convolutional layers.

为更好地理解本申请方案，下面对本申请卷积运算的处理方法以一具体实施例在ARM架构的CPU上的卷积运算为例进行描述，在本申请该具体实施例中，将卷积运算成为卷积计算，实际上意思是一致的。本申请该具体实施例的目的在于提出了一个卷积计算的加速库，该库可以自动生成优化后的卷积计算，以探索在来自不同供应商的最新的基于ARM的硬件架构上的最佳性能。因为卷积计算可以通过Im2col算法转换成矩阵乘法，所以卷积计算库主要针对矩阵乘法进行了优化，将矩阵乘法集成到一个参数化的可重配置库，它用于搜索矩阵乘法中运行时参数的最佳组合，运行时参数组合其中包括寄存器内核形状，缓存分块大小以及针对任何给定卷积形状和硬件目标的调度策略。In order to better understand the scheme of the present application, the following describes the processing method of the convolution operation of the application by taking a specific embodiment of the convolution operation on the CPU of the ARM architecture as an example. In this specific embodiment of the application, the convolution The operation becomes a convolution calculation, which actually means the same. The purpose of this specific embodiment of the present application is to propose an accelerated library for convolution calculations, which can automatically generate optimized convolution calculations to explore the best performance on the latest ARM-based hardware architectures from different suppliers. performance. Because the convolution calculation can be converted into matrix multiplication through the Im2col algorithm, the convolution calculation library is mainly optimized for matrix multiplication, integrating matrix multiplication into a parameterized reconfigurable library, which is used to search for runtime parameters in matrix multiplication An optimal combination of , a combination of runtime parameters that include register kernel shape, cache tile size, and scheduling strategy for any given convolution shape and hardware target.

本申请的方法包括如下设计特点：The method of the present application includes the following design features:

(1)因为卷积计算可以通过Im2col算法转换成矩阵乘法，所以本申请主要是对矩阵乘法进行优化。(1) Because the convolution calculation can be converted into matrix multiplication through the Im2col algorithm, this application mainly optimizes the matrix multiplication.

(2)一个可重配置的矩阵乘法库。卷积计算通过Im2col算法转换成矩阵乘法，所以本申请针对矩阵乘法设计了一个可重配置的库，该库具有多级代码缓存层次结构。它用于搜索和重现所有可能的代码结构，包括寄存器内核形状，缓存分块大小以及循环顺序调度，重排策略，内存访问模式，在线/离线计算的各种组合。通过这个可重配置的矩阵乘法库，减少了人工调优的工作量。(2) A reconfigurable matrix multiplication library. The convolution calculation is converted into matrix multiplication through the Im2col algorithm, so this application designs a reconfigurable library for matrix multiplication, which has a multi-level code cache hierarchy. It is used to search and reproduce all possible code structures, including register kernel shapes, cache block sizes and various combinations of loop order scheduling, reordering strategies, memory access patterns, online/offline computations. With this reconfigurable matrix multiplication library, the workload of manual tuning is reduced.

(3)基于可配置算法库的自动优化方法。将生成的微内核嵌入到可重配置的矩阵乘法库后，与Im2col算法结合就可以构建一个卷积计算库。它可以使用自动调整策略来搜索所有参数配置，以获得在给定硬件规格和卷积问题大小下的最佳性能。最佳参数配置可以存储并重复使用。这个卷积计算库可以嵌入到现有的深度学习框架软件栈中。针对不同硬件规格和卷积形状所形成的各种组合，该卷积计算库无需进行人工优化。(3) An automatic optimization method based on a configurable algorithm library. After embedding the generated microkernel into a reconfigurable matrix multiplication library, a convolution computing library can be constructed by combining it with the Im2col algorithm. It can use an auto-tuning strategy to search all parameter configurations for the best performance given the hardware specification and convolution problem size. Optimal parameter configurations can be stored and reused. This convolution computing library can be embedded into existing deep learning framework software stacks. For various combinations of different hardware specifications and convolution shapes, the convolution calculation library does not require manual optimization.

请参阅图6以及图7，下面结合图6以及图7介绍本申请的设计思路，图6是本申请卷积运算的处理方法一具体实施例的矩阵框架示意图，图7是图1是本申请矩阵的分块结构示意图。Please refer to Fig. 6 and Fig. 7, the design idea of the present application will be introduced in conjunction with Fig. 6 and Fig. 7 below, Fig. 6 is the matrix framework schematic diagram of a specific embodiment of the processing method of the convolution operation of the present application, Fig. 7 is Fig. 1 is the schematic diagram of the present application Schematic diagram of the block structure of the matrix.

一、使用Im2col算法将卷积计算转换成矩阵乘法1. Use Im2col algorithm to convert convolution calculation into matrix multiplication

与矩阵乘法类似，卷积计算将每个卷积核张量和输入图像的分块逐个元素相乘，然后累加所有输入通道的结果。由于其较高的计算访存比，矩阵乘法在大型正方形矩阵运算上得到了很好的优化。Im2col算法的步骤如图6所示。它先将卷积核张量F重整为尺寸为K×CRS的矩阵，然后将原始输入图像复制为尺寸为CRS×HW的矩阵。通过这两步就将卷积计算转换为矩阵乘法。接着可以通过使用单个矩阵乘法得到尺寸为K×EF的输出矩阵。因此，Im2col算法在深度学习中生成的矩阵具有多种形状，例如细长的长条形矩阵。固定的微内核，数据布局和数据处理的调度可能无法为不同形状的矩阵运算提供最佳解决方案。Similar to matrix multiplication, the convolution computation multiplies each kernel tensor with a block of the input image element-wise, and then accumulates the results across all input channels. Matrix multiplication is well optimized for operations on large square matrices due to its high compute-to-memory ratio. The steps of the Im2col algorithm are shown in Figure 6. It first reshaped the convolution kernel tensor F into a matrix of size K×CRS, and then copied the original input image into a matrix of size CRS×HW. Through these two steps, the convolution calculation is converted into matrix multiplication. An output matrix of size KxEF can then be obtained by using a single matrix multiplication. Therefore, the matrix generated by the Im2col algorithm in deep learning has various shapes, such as a long and thin matrix. A fixed microkernel, data layout, and scheduling of data processing may not provide optimal solutions for matrix operations of different shapes.

具体地，F _m可以是卷积核的重排，右边是按F正方形形状对D进行核算，然后竖着从下往上排D，红色对红色，绿色对绿色，蓝色对蓝色，然后得到O _m。 Specifically, F _m can be the rearrangement of the convolution kernel, and the right side is to calculate D in the shape of F square, and then arrange D vertically from bottom to top, red to red, green to green, blue to blue, and then Get O _m .

二、搭建可重配置的矩阵乘法库框架以及自动优化2. Build a reconfigurable matrix multiplication library framework and automatic optimization

首先，需要搭建一个可重配置的矩阵乘法框架。对于矩阵乘法的实现，有多种高性能的实现。根据一些论文中的分析，比如Goto论文，在行主序的情况下，其中一种名称为“GEPB”的方法性能明显优于其他方法。因此，选择“GEPB”的方法来实现可重新配置的矩阵乘法框架。“GEPB”方法的步骤图如图所示，具体步骤如下：First, a reconfigurable matrix multiplication framework needs to be built. For the implementation of matrix multiplication, there are various high-performance implementations. According to the analysis in some papers, such as the Goto paper, in the case of row-major order, one of the methods named "GEPB" performs significantly better than other methods. Therefore, the approach of "GEPB" is chosen to implement a reconfigurable matrix multiplication framework. The step diagram of the "GEPB" method is shown in the figure, and the specific steps are as follows:

(1)首先将矩阵A按列分成若干列矩阵分块，将矩阵B按行分成若干行矩阵分块。(1) First, the matrix A is divided into several column matrix blocks according to the columns, and the matrix B is divided into several row matrix blocks according to the rows.

(2)接着对矩阵A的一列矩阵分块的内存进行重新排布，将矩阵B的行矩阵分块再按列分成若干个列矩阵分块。(2) Then rearrange the memory of a column matrix block of matrix A, divide the row matrix block of matrix B into several column matrix blocks by column.

(3)矩阵A的一列矩阵分块再按行分成若干个行矩阵分块。然后对矩阵B的小分块的内存进行重新排布。(3) One-column matrix block of matrix A is divided into several row matrix blocks by row. Then rearrange the memory of the small blocks of matrix B.

(4)将矩阵A的一列矩阵分块与矩阵B的小分块进行计算，并将计算的结果写回到C矩阵中。(4) Calculate a column of matrix blocks of matrix A and a small block of matrix B, and write the calculated result back into matrix C.

(5)重复上述步骤直到计算完毕。(5) Repeat the above steps until the calculation is completed.

但是原有的“GEPB”方法并没有针对不同的硬件平台进行相应的优化，因此需要在这个方法中加入一些能够适应不同硬件平台的优化策略，由此形成一个可重配置的矩阵乘法框架。CPU中的TLB和缓存的信息将影响矩阵乘法中缓存分块，寄存器块和数据预取的大小，这将严重影响矩阵乘法的性能，尤其是在经过Im2col算法之后深度学习中的矩阵通常为长条形矩阵的情况下。但是，对于不同的硬件配置，TLB和缓存的特定信息是不同的。此外，在不同的卷积形状下，我们很难基于TLB和缓存信息来量化矩阵乘法性能。因此，可以从矩阵乘法实现中提取了几个与缓存及TLB相关的运行时参数，如表2所示。为了确定将矩阵的不同分块存储到不同的缓存(L1、L2、L3等)中，还用了循环重新排序的方法来控制矩阵乘法子例程的内存访问模式。缓存分块上的循环重新排序可以确定是矩阵A的分块将被扫描一次并保存在L1高速缓存中，而矩阵B的分块将被扫描多次并存储在L2高速缓存中，还是矩阵B的分块将被扫描一次并保存在L1高速缓存中，而矩阵A的分块将被扫描多次并存储在L2高速缓存中。使用这些运行时参数，就可以搭建一个可重配置的矩阵乘法库。自动优化的过程则是在矩阵乘法框架上进行的。However, the original "GEPB" method has not been optimized for different hardware platforms, so some optimization strategies that can adapt to different hardware platforms need to be added to this method, thus forming a reconfigurable matrix multiplication framework. The TLB and cache information in the CPU will affect the size of the cache block, register block and data prefetch in the matrix multiplication, which will seriously affect the performance of the matrix multiplication, especially after the Im2col algorithm, the matrix in deep learning is usually long In the case of a bar matrix. However, for different hardware configurations, the specific information of TLB and cache is different. Furthermore, under different convolution shapes, it is difficult for us to quantify matrix multiplication performance based on TLB and cache information. Therefore, several runtime parameters related to cache and TLB can be extracted from the matrix multiplication implementation, as shown in Table 2. In order to determine the storage of different blocks of the matrix into different caches (L1, L2, L3, etc.), a loop reordering method is also used to control the memory access pattern of the matrix multiplication subroutine. A circular reordering on the cache partitions can determine whether the partitions of matrix A will be scanned once and kept in L1 cache, and the partitions of matrix B will be scanned multiple times and stored in L2 cache, or matrix B The tiles of matrix A will be scanned once and stored in L1 cache, while the tiles of matrix A will be scanned multiple times and stored in L2 cache. Using these runtime parameters, a reconfigurable matrix multiplication library can be built. The process of automatic optimization is carried out on the framework of matrix multiplication.

请参阅图8，图8是本申请卷积运算的处理方法一具体实施例的流程示意图；整个卷积计算和自动优化的流程如图8所示，自动优化的步骤如下：Please refer to Fig. 8, Fig. 8 is a schematic flow diagram of a specific embodiment of the processing method of the convolution operation of the present application; the entire convolution calculation and automatic optimization process is shown in Fig. 8, and the steps of automatic optimization are as follows:

S801：开始；S801: start;

S802：执行Im2col算法，将卷积运算转换成矩阵乘法；S802: Execute the Im2col algorithm to convert the convolution operation into matrix multiplication;

S803：本地数据库中是否存在最优参数；S803: Whether there are optimal parameters in the local database;

首先会查询本地配置数据库，看当前硬件配置及卷积尺寸下是否已经存在最优的配置参数，其中，最优的配置参数可以是上一次在硬件配置以及卷积尺寸相同的情况下运行矩阵乘法得到的最优结果，从而找到的最优的配置参数。如果已经存在最优参数配置，那么进入步骤S804：从根据最优参数组合生成代码；如果不存在，则进入步骤S805：利用运算代码运算矩阵乘法，得到运算结果。First, the local configuration database will be queried to see if there are optimal configuration parameters under the current hardware configuration and convolution size. Among them, the optimal configuration parameters can be the matrix multiplication performed last time with the same hardware configuration and convolution size. The optimal result is obtained, so as to find the optimal configuration parameters. If there is an optimal parameter configuration, go to step S804: generate codes from the optimal parameter combination; if not, go to step S805: use the operation code to perform matrix multiplication to obtain the operation result.

S806：从参数搜索空间中，选择一参数组合；S806: Select a parameter combination from the parameter search space;

在参数搜索空间中，在多个配置参数种类情况下，可以有多组参数组合，但一组参数组合只能有一个运算结果，因此为便于后续的运算结果比较，可以选择一参数组合。In the parameter search space, in the case of multiple configuration parameter types, there can be multiple sets of parameter combinations, but a set of parameter combinations can only have one operation result, so in order to facilitate the comparison of subsequent operation results, one parameter combination can be selected.

S807：根据参数组合生成运算代码；S807: Generate operation codes according to parameter combinations;

如果确定不存在最优参数配置，那么具体硬件参数及卷积尺寸定义一个参数配置搜索空间，根据参数组合生成运算代码，用于执行算法。If it is determined that there is no optimal parameter configuration, then the specific hardware parameters and convolution size define a parameter configuration search space, and the operation code is generated according to the parameter combination to execute the algorithm.

S808：利用运算代码运算矩阵乘法，得到运算结果；S808: Using the operation code to perform matrix multiplication to obtain an operation result;

S809：该运算结果是否比原来的最优结果好；S809: Whether the operation result is better than the original optimal result;

若该运算结果是第一运算结果，则不存在原来的最优结果，不需要进行比较，若有原来的最优结果，则可以将该运算结果与原来的最优结果进行比较，若该运算结果比原来的最优结果好，则进入步骤S810：更新配置数据库，在整个参数配置空间中进行搜索，如果比当前最好的性能还好，那么更新最优参数配置，直到搜索完整个参数配置空间。然后进入步骤S811：是否遍历搜索完毕；若遍历搜索完毕，则进入步骤S812：调优结束，保存最优代码；If the operation result is the first operation result, there is no original optimal result, and no comparison is required. If there is an original optimal result, the operation result can be compared with the original optimal result. If the operation The result is better than the original optimal result, then enter step S810: update the configuration database, search in the entire parameter configuration space, if it is better than the current best performance, then update the optimal parameter configuration until the entire parameter configuration is searched space. Then enter step S811: whether the traversal search is completed; if the traversal search is completed, then enter step S812: the tuning is completed, and the optimal code is saved;

S813：结束。S813: end.

本申请搭建一个可重配置的矩阵乘法框架，以适应不同特性的ARM架构CPU计算平台，从普通的矩阵乘法中提取出几个关键的与性能相关的运行时参数，所以这个矩阵乘法库可以针对不同的硬件架构和卷积形状生成性能较优的矩阵乘法代码，从而高效地完成卷积计算。This application builds a reconfigurable matrix multiplication framework to adapt to ARM architecture CPU computing platforms with different characteristics, and extracts several key performance-related runtime parameters from ordinary matrix multiplication, so this matrix multiplication library can be used for Different hardware architectures and convolution shapes generate better-performing matrix multiplication codes to efficiently complete convolution computations.

在可重配置的矩阵乘法框架中使用自动优化的技术，从所有参数组合中搜索到当前硬件架构和卷积形状下最优的参数组合，可以解决长条形矩阵乘法性能较差的问题，从而得到性能较好的矩阵乘法来完成卷积计算，以获得足够好的性能。Using automatic optimization technology in the reconfigurable matrix multiplication framework, searching for the optimal parameter combination under the current hardware architecture and convolution shape from all parameter combinations can solve the problem of poor performance of long strip matrix multiplication, thereby Get better performance matrix multiplication to complete the convolution calculation to get good enough performance.

为了证实该方案有效可行，在华为服务器级别的芯片Kunpeng920上进行了性能测试，通常电脑要优于手机，Kunpeng920的硬件配置如下表表3所示：In order to prove that the solution is effective and feasible, a performance test was carried out on Huawei's server-level chip Kunpeng920. Usually, computers are better than mobile phones. The hardware configuration of Kunpeng920 is shown in Table 3 below:

表3 Kunpeng920的硬件配置Table 3 Hardware configuration of Kunpeng920

ProcessorProcessor	#CPUs@Clock Speed#CPUs @ Clock Speed	MemoryMemory	CompilerCompiler
Kunpeng 920Kunpeng 920	[email protected](64KB-L1,512KB-L2,32MB-share-L3)[email protected] (64KB-L1, 512KB-L2, 32MB-share-L3)	16 GB DDR416 GB DDR4	GCC version 9.3.0GCC version 9.3.0

测试选择的深度学习网络模型为VGG-16。因为卷积计算中的Im2col算法的性能与其他现有的技术是相同的，因此我们直接将卷积计算中的另一部分矩阵乘法挑出来进行性能测试并比较。我们的矩阵乘法实现版本FastConv-GEMM与AutoTVM，AutoTVM+LIBXSMM，AutoTVM+OpenBLAS和OpenBLAS进行了比较。本申请的实现分成了两个版本：FastConv without autotuning，它表示在可重配置的矩阵乘法库中，使用默认的参数配置，不进行自动优化直接进行测试的结果；FastConv with autoTuning，它表示在可重配置的矩阵乘法库中，先进行自动优化，找到当前硬件架构和卷积尺寸下的最优参数配置，然后使用这个最优参数配置进行测试的结果。The deep learning network model selected for the test is VGG-16. Because the performance of the Im2col algorithm in the convolution calculation is the same as other existing technologies, we directly pick out another part of the matrix multiplication in the convolution calculation for performance testing and comparison. Our matrix multiplication implementation FastConv-GEMM is compared with AutoTVM, AutoTVM+LIBXSMM, AutoTVM+OpenBLAS and OpenBLAS. The implementation of this application is divided into two versions: FastConv without autotuning, which represents the results of direct testing without automatic optimization in the reconfigurable matrix multiplication library, using the default parameter configuration; FastConv with autoTuning, which represents In the reconfigured matrix multiplication library, automatic optimization is performed first to find the optimal parameter configuration under the current hardware architecture and convolution size, and then use this optimal parameter configuration for testing results.

请参阅图9，图9是本申请卷积运算的处理方法的结果示意图，图9显示了在Kunpeng 920上以实现的AutoTVM为测试基准的加速比，其中AutoTVM在Kunpeng 920上使用自己的Halide生成内核，是所有六个竞争对手的基准。与通过将OpenBLAS和LIBXSMM进行AutoTVM调优相比，OpenBLAS计算库本身具有更好的性能，这表明AutoTVM使用机器学习的调度方法无法在Kunpeng 920上提高先前的高性能矩阵计算库的效率。但是，在AutoTVM的启发下，我们的使用默认参数配置的可重配置矩阵乘法计算库的性能排在所有方法中的第二，仅次于进行自动调优后的可重配置矩阵乘法计算库。在进行自动优化之后，可重配置的矩阵乘法计算库在不同卷积层上的性能提高了大约2％～17％，如图9中有一条Speedup over AutoTVM折线表示本方案优化后的与未优化的比值。因为在VGG-16网络模型中中间几层卷积计算生成的矩阵大多为方形矩阵，而起始几层和结尾几层的卷积计算生成的矩阵大多为长条形矩阵。因此从图中可以得出结论，我们的可重配置矩阵乘法计算库在进行自动优化后，可以显著提高长条形矩阵的矩阵乘法性能，同时对正方形矩阵的矩阵乘法性能没有负面影响。Please refer to Figure 9. Figure 9 is a schematic diagram of the results of the convolution operation processing method of this application. Figure 9 shows the speedup ratio of the AutoTVM implemented on the Kunpeng 920 as the test benchmark, where AutoTVM is generated using its own Halide on the Kunpeng 920 core, which is the benchmark for all six competitors. Compared with AutoTVM tuning by OpenBLAS and LIBXSMM, the OpenBLAS computing library itself has better performance, which indicates that the AutoTVM scheduling method using machine learning cannot improve the efficiency of previous high-performance matrix computing libraries on Kunpeng 920. However, inspired by AutoTVM, the performance of our reconfigurable matrix multiplication calculation library configured with default parameters ranks second among all methods, only after the automatic tuning reconfigurable matrix multiplication calculation library. After automatic optimization, the performance of the reconfigurable matrix multiplication calculation library on different convolutional layers has increased by about 2% to 17%. In Figure 9, there is a Speedup over AutoTVM broken line indicating the optimized and unoptimized version of this scheme. ratio. Because in the VGG-16 network model, the matrices generated by the convolution calculations of the middle layers are mostly square matrices, while the matrices generated by the convolution calculations of the first few layers and the last few layers are mostly long strip matrices. Therefore, it can be concluded from the figure that our reconfigurable matrix multiplication computing library can significantly improve the matrix multiplication performance of long strip matrices after automatic optimization, and has no negative impact on the matrix multiplication performance of square matrices.

进一步地，请参见图10，图10是本申请的移动终端一实施例的示意框图。本申请实施例提供一种移动终端2，包括处理器21和存储器22，存储器22中存储有计算机程序221，处理器21用于执行计算机程序221以本申请实施例第一方面的处理方法，在此不再赘述。Further, please refer to FIG. 10 , which is a schematic block diagram of an embodiment of a mobile terminal of the present application. The embodiment of the present application provides a mobile terminal 2, including a processor 21 and a memory 22, the memory 22 stores a computer program 221, and the processor 21 is used to execute the computer program 221 to use the processing method of the first aspect of the embodiment of the present application, in This will not be repeated here.

请参阅图11，图11是本申请的计算机可读存储介质一实施例的示意框图。如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在计算机可读存储介质30中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储装置中，包括若干指令(计算机程序31)用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)或处理器(processor)执行本申请各个实施方式方法的全部或部分步骤。而前述的存储装置包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种介质以及具有上述存储介质的电脑、手机、笔记本电脑、平板电脑、相机等电子设备。Please refer to FIG. 11 . FIG. 11 is a schematic block diagram of an embodiment of a computer-readable storage medium of the present application. If implemented in the form of a software function unit and sold or used as an independent product, it can be stored in the computer-readable storage medium 30 . Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage device , including several instructions (computer program 31) to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) execute all or part of the steps of the methods in various embodiments of the present application. The aforementioned storage devices include: various media such as U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc., and have the above-mentioned storage medium. Computers, mobile phones, laptops, tablets, cameras and other electronic equipment.

关于计算机可读存储介质中的计算机程序的执行过程的阐述可以参照上述本申请移动终端2的处理方法实施例中阐述，在此不再赘述。For the description of the execution process of the computer program in the computer-readable storage medium, reference may be made to the description in the embodiment of the processing method of the mobile terminal 2 of the present application, and details are not repeated here.

以上所述仅为本申请的部分实施例，并非因此限制本申请的保护范围，凡是利用本申请说明书及附图内容所作的等效装置或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本申请的专利保护范围内。The above descriptions are only part of the embodiments of the application, and are not intended to limit the scope of protection of the application. All equivalent devices or equivalent process transformations made by using the contents of the specification and drawings of the application, or directly or indirectly used in other related All technical fields are equally included in the patent protection scope of the present application.

Claims

一种卷积运算的处理方法，其特征在于，所述方法包括：A processing method for convolution operation, characterized in that the method comprises:

获取待处理的卷积运算以及配置数据库；Obtain the pending convolution operation and configure the database;

将所述卷积运算转换为矩阵乘法，所述矩阵乘法对应一卷积尺寸；converting the convolution operation into a matrix multiplication, the matrix multiplication corresponding to a convolution size;

若确定所述配置数据库中无所述卷积尺寸对应的配置参数，则根据所述卷积尺寸以及硬件参数定义一参数搜索空间；If it is determined that there is no configuration parameter corresponding to the convolution size in the configuration database, then define a parameter search space according to the convolution size and hardware parameters;

根据所述参数搜索空间中的配置参数生成多个运算代码，并利用所述多个运算代码对所述矩阵乘法进行计算，以得到多个运算结果；generating a plurality of operation codes according to configuration parameters in the parameter search space, and using the plurality of operation codes to calculate the matrix multiplication to obtain a plurality of operation results;

将所述多个运算结果中满足预设条件的一个运算结果对应的运算代码的配置参数，存储至所述配置数据库。storing the configuration parameters of the operation code corresponding to one operation result satisfying the preset condition among the plurality of operation results in the configuration database.
根据权利要求1所述的处理方法，其特征在于，The processing method according to claim 1, characterized in that,

所述若确定所述配置数据库中无所述卷积尺寸对应的配置参数，则根据所述卷积尺寸以及硬件参数定义一参数搜索空间，包括：If it is determined that there is no configuration parameter corresponding to the convolution size in the configuration database, then define a parameter search space according to the convolution size and hardware parameters, including:

判断所述配置数据库中是否有所述卷积尺寸对应的配置参数；judging whether there is a configuration parameter corresponding to the convolution size in the configuration database;

若所述配置数据库中有所述卷积尺寸对应的配置参数，则按照所述配置参数生成运算代码并进行计算得到运算结果；If there is a configuration parameter corresponding to the convolution size in the configuration database, generating an operation code according to the configuration parameter and performing calculation to obtain an operation result;

若所述配置数据库中无所述卷积尺寸对应的配置参数，则根据所述卷积尺寸以及所述硬件参数定义所述配置参数对应的一参数搜索空间。If there is no configuration parameter corresponding to the convolution size in the configuration database, a parameter search space corresponding to the configuration parameter is defined according to the convolution size and the hardware parameters.
根据权利要求2所述的处理方法，其特征在于，The processing method according to claim 2, characterized in that,

所述卷积尺寸对应的所述配置参数至少包括第一矩阵的行数、第一矩阵的列数、第一矩阵的缓存块的行数、第一矩阵的缓存块的列数、第二矩阵的列数、第二矩阵的缓存块的列数、寄存器块的行数、寄存器块的列数、第一矩阵的预取值、第二矩阵的预取值以及搜索空间标签；The configuration parameters corresponding to the convolution size include at least the number of rows of the first matrix, the number of columns of the first matrix, the number of rows of the cache block of the first matrix, the number of columns of the cache block of the first matrix, the number of columns of the second matrix The number of columns of the second matrix, the number of columns of the cache block of the second matrix, the number of rows of the register block, the number of columns of the register block, the prefetch value of the first matrix, the prefetch value of the second matrix, and the search space label;

其中，所述第一矩阵的缓存块的行数的取值范围为[8，max(M，1024)]，所述M为所述第一矩阵的行数，所述第一矩阵的缓存块的列数的取值范围为[8，max(K，1024)]，所述K为所述第一矩阵的列数，所述第二矩阵的缓存块的列数的取值范围为[8，max(N，1024)]，所述N为所述第二矩阵的行数，所述寄存器块的行数的取值范围为4或8，所述寄存器块的列数为8、12或16，第一矩阵的预取值以及第二矩阵的预取值至少包括0、32、64、128、256或512之一，所述搜索空间标签取值至少包括0、1、2或3。Wherein, the value range of the number of rows of the cache block of the first matrix is [8, max(M, 1024)], the M is the number of rows of the first matrix, and the cache block of the first matrix The value range of the column number is [8, max(K, 1024)], the K is the column number of the first matrix, and the value range of the column number of the cache block of the second matrix is [8 , max(N, 1024)], the N is the number of rows of the second matrix, the value range of the number of rows of the register block is 4 or 8, and the number of columns of the register block is 8, 12 or 16. The prefetched value of the first matrix and the prefetched value of the second matrix include at least one of 0, 32, 64, 128, 256, or 512, and the value of the search space label includes at least 0, 1, 2, or 3.
根据权利要求3所述的处理方法，其特征在于，The processing method according to claim 3, characterized in that,

所述若所述配置数据库中无所述卷积尺寸对应的配置参数，则根据所述卷积尺寸以及所述硬件参数定义所述配置参数对应的一参数搜索空间，包括：If there is no configuration parameter corresponding to the convolution size in the configuration database, then define a parameter search space corresponding to the configuration parameter according to the convolution size and the hardware parameters, including:

根据所述硬件参数配置所述卷积尺寸对应的多组参数组合，得到所述配置参数；Configuring multiple sets of parameter combinations corresponding to the convolution size according to the hardware parameters to obtain the configuration parameters;

从多组所述参数组合中选取一组；selecting one set from multiple sets of said parameter combinations;

基于选取的一组参数组合，定义对应的所述参数搜索空间。Based on the selected set of parameter combinations, the corresponding parameter search space is defined.
根据权利要求4所述的处理方法，其特征在于，The processing method according to claim 4, characterized in that,

所述在所述参数搜索空间生成多个运算代码，并利用所述多个运算代码对所述矩阵乘法进行计算，以得到多个运算结果，包括：The generating a plurality of operation codes in the parameter search space, and using the plurality of operation codes to calculate the matrix multiplication to obtain a plurality of operation results includes:

基于选取的参数组合，在所述参数搜索空间中，生成所述卷积对应的多个运算代码；Based on the selected parameter combination, in the parameter search space, generate a plurality of operation codes corresponding to the convolution;

利用所述多个运算代码，对所述矩阵乘法进行计算，得到第一运算结果以及第I运算结果，其中I为大于1的正整数且小于或等于所述运算代码的个数。Using the plurality of operation codes to calculate the matrix multiplication to obtain a first operation result and an I-th operation result, wherein I is a positive integer greater than 1 and less than or equal to the number of the operation codes.
根据权利要求5所述的处理方法，其特征在于，The processing method according to claim 5, characterized in that,

所述将所述多个运算结果中满足预设条件的一个运算结果对应的运算代码的配置参数，存储至所述配置数据库，包括：The storing the configuration parameters of the operation code corresponding to one of the plurality of operation results satisfying the preset condition in the configuration database includes:

判断所述第一运算结果和/或所述第I运算结果是否满足预设条件，所述预设条件至少包括所述矩阵乘法计算的时间段为多个运算结果中最短；Judging whether the first calculation result and/or the first calculation result meet a preset condition, the preset condition at least includes that the time period for the matrix multiplication calculation is the shortest among multiple calculation results;

若所述第一运算结果和/或所述第I运算结果满足预设条件，则将所述第一运算结果和/或所述第I运算结果对应的所述配置参数存储至所述配置数据库；If the first calculation result and/or the first calculation result satisfy a preset condition, then store the configuration parameters corresponding to the first calculation result and/or the first calculation result in the configuration database ;

若所述第一运算结果和/或第I运算结果不满足预设条件，则舍弃所述第I运算结果对应的所述配置参数，并将所述第一运算结果对应的所述配置参数存储至所述配置数据库。If the first calculation result and/or the first calculation result do not satisfy the preset condition, discard the configuration parameter corresponding to the first calculation result, and store the configuration parameter corresponding to the first calculation result to the configuration database.
根据权利要求1～6任一项所述的处理方法，其特征在于，The processing method according to any one of claims 1 to 6, characterized in that,

所述将所述卷积运算转换为矩阵乘法，所述矩阵乘法对应一卷积尺寸，包括：The described convolution operation is converted into matrix multiplication, and the matrix multiplication corresponds to a convolution size, including:

对卷积执行Im2col算法，以将所述卷积运算转换成所述卷积对应的矩阵乘法计算，所述矩阵乘法对应一卷积尺寸。The Im2col algorithm is performed on the convolution to convert the convolution operation into a matrix multiplication calculation corresponding to the convolution, and the matrix multiplication corresponds to a convolution size.
根据权利要求1～6任一项所述的处理方法，其特征在于，The processing method according to any one of claims 1 to 6, characterized in that,

所述配置数据库对应的可重配置的矩阵乘法计算库在不同卷积层上的性能提高了2％～17％。The performance of the reconfigurable matrix multiplication calculation library corresponding to the configuration database on different convolution layers is improved by 2% to 17%.
一种移动终端，其特征在于，包括：处理器和存储器，所述存储器中存储有计算机程序，所述处理器用于执行所述计算机程序以实现如权利要求1-8任一项所述的处理方法。A mobile terminal, characterized in that it comprises: a processor and a memory, a computer program is stored in the memory, and the processor is used to execute the computer program to realize the processing according to any one of claims 1-8 method.
一种计算机可读存储介质，其特征在于，该计算机可读存储介质存储有计算机程序，计算机程序能够被处理器执行时实现如权利要求1-8任一项所述的处理方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program can implement the processing method according to any one of claims 1-8 when executed by a processor.