WO2021088964A1

WO2021088964A1 - Inference system, inference method, electronic device and computer storage medium

Info

Publication number: WO2021088964A1
Application number: PCT/CN2020/127026
Authority: WO
Inventors: 林立翔; 李鹏; 游亮; 龙欣
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2019-11-08
Filing date: 2020-11-06
Publication date: 2021-05-14
Also published as: TW202119255A; CN112784989B; CN112784989A

Abstract

Provided are an inference system, an inference method, an electronic device and a computer storage medium. The inference system comprises a first computing device and a second computing device that are connected to each other, with the first computing device being provided with an inference client, and the second computing device comprising an inference acceleration resource and an inference server, wherein the inference client is used for acquiring model information of a computing model for inference and data to be inferred, and for respectively sending the model information and said data to the inference server in the second computing device; and the inference server is used for loading and calling, by means of the inference acceleration resource, the computing model indicated by the model information, and performing, by means of the computing model, inference processing on said data, and feeding back the result of the inference processing to the inference client.

Description

推理***、推理方法、电子设备及计算机存储介质Reasoning system, reasoning method, electronic equipment and computer storage medium

本申请要求2019年11月08日递交的申请号为201911089253.X、发明名称为“推理***、推理方法、电子设备及计算机存储介质”中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on November 8, 2019 with the application number 201911089253.X and the title of the invention "inference system, inference method, electronic equipment and computer storage medium", the entire content of which is incorporated herein by reference Applying.

技术领域Technical field

本发明实施例涉及计算机技术领域，尤其涉及一种推理***、推理方法、电子设备及计算机存储介质。The embodiments of the present invention relate to the field of computer technology, in particular to an inference system, an inference method, electronic equipment, and a computer storage medium.

背景技术Background technique

深度学习一般分为训练和推理(Inference)两个部分，其中，通过训练部分搜索和求解模型的最优参数，而通过推理部分则可以将训练完成的模型部署在在线环境中，以进行实际使用。以人工智能领域为例，推理在部署后，可通过神经网络推导计算将输入转化为特定目标输出。例如，对图片进行物体检测、对文字内容进行分类等，在视觉、语音、推荐等场景被广泛应用。Deep learning is generally divided into two parts: training and inference. The training part searches for and solves the optimal parameters of the model, and the inference part allows the trained model to be deployed in an online environment for actual use. . Taking the field of artificial intelligence as an example, after inference is deployed, the input can be converted into specific target output through neural network derivation and calculation. For example, object detection on pictures, classification of text content, etc., are widely used in scenes such as vision, voice, and recommendation.

目前，大部分的推理依赖于具有推理加速卡如GPU(Graphics Processing Unit，图形处理器)的硬件计算资源。例如，在人工智能推理中，一种方式是GPU通过PCIE(Peripheral Component Interconnect Express，高速串行计算机扩展总线标准)插槽与计算机主机连接。其中，推理涉及的前后处理和其他业务逻辑通过CPU计算，而推理的处理则通过PCIE插槽发送到GPU进行计算，形成典型的异构计算场景。例如，在图1所示的电子设备100中，同时设置有CPU 102和GPU 104，GPU 104可以通过PCIE插槽106设置于电子设备主板108上，并通过主板108上的主板线路与CPU 102交互。在一个推理过程中，CPU 102首先对相关数据或信息进行处理，进而将处理后的数据或信息通过PCIE插槽106发送到GPU 104，GPU 104根据接收的数据或信息，使用GPU 104中的计算模型进行推理处理，之后，再将推理处理结果返回给CPU 102，CPU 102再进行相应的后续处理。Currently, most inferences rely on hardware computing resources with inference accelerator cards such as GPU (Graphics Processing Unit). For example, in artificial intelligence reasoning, one way is to connect the GPU to the host computer through a PCIE (Peripheral Component Interconnect Express, high-speed serial computer expansion bus standard) slot. Among them, the pre-processing and other business logic involved in the inference are calculated by the CPU, and the processing of the inference is sent to the GPU through the PCIE slot for calculation, forming a typical heterogeneous computing scenario. For example, in the electronic device 100 shown in FIG. 1, both a CPU 102 and a GPU 104 are provided. The GPU 104 can be set on the electronic device motherboard 108 through the PCIE slot 106, and interact with the CPU 102 through the motherboard circuit on the motherboard 108 . In a reasoning process, the CPU 102 first processes relevant data or information, and then sends the processed data or information to the GPU 104 through the PCIE slot 106. The GPU 104 uses the calculations in the GPU 104 according to the received data or information. The model performs inference processing, and then returns the inference processing result to the CPU 102, and the CPU 102 performs corresponding subsequent processing.

但是，上述方式存在以下问题：需要CPU和GPU同台的异构计算机器，且该异构计算机器中的CPU/GPU的规格固定，这种固定的CPU/GPU性能配比限制了涉及推理的应用的部署，导致无法满足广泛的推理场景需求。However, the above method has the following problems: a heterogeneous computing machine with the same CPU and GPU is required, and the specifications of the CPU/GPU in the heterogeneous computing machine are fixed. This fixed CPU/GPU performance ratio limits the reasoning involved The deployment of applications makes it impossible to meet the needs of a wide range of reasoning scenarios.

发明内容Summary of the invention

有鉴于此，本发明实施例提供一种推理方案，以解决上述部分或全部问题。In view of this, the embodiment of the present invention provides a reasoning solution to solve some or all of the above-mentioned problems.

根据本发明实施例的第一方面，提供了一种推理***，包括相互连接的第一计算设备和第二计算设备，其中，所述第一计算设备中设置有推理客户端，所述第二计算设备中设置有推理加速资源以及推理服务端；其中：所述推理客户端用于获取进行推理的计算模型的模型信息和待推理数据，并分别将所述模型信息和所述待推理数据发送至所述第二计算设备中的推理服务端；所述推理服务端用于通过推理加速资源载入并调用所述模型信息指示的计算模型，通过所述计算模型对所述待推理数据进行推理处理并向所述推理客户端反馈所述推理处理的结果。According to a first aspect of the embodiments of the present invention, an inference system is provided, including a first computing device and a second computing device connected to each other, wherein the first computing device is provided with an inference client, and the second The computing device is provided with inference acceleration resources and inference server; wherein: the inference client is used to obtain model information and data to be inferred of the calculation model for inference, and respectively send the model information and the data to be inferred To the inference server in the second computing device; the inference server is used to speed up resource loading through inference and call the calculation model indicated by the model information, and perform inference on the data to be inferred through the calculation model Process and feed back the result of the inference processing to the inference client.

根据本发明实施例的第二方面，提供了一种推理方法，所述方法包括：获取进行推理的计算模型的模型信息，并将所述模型信息发送至目标计算设备，以指示所述目标计算设备使用所述目标计算设备中设置的推理加速资源载入所述模型信息指示的计算模型；获取待推理数据，并将所述待推理数据发送至所述目标计算设备，以指示所述目标计算设备使用推理加速资源调用载入的所述计算模型，通过所述计算模型对所述待推理数据进行推理处理；接收所述目标计算设备反馈的所述推理处理的结果。According to a second aspect of the embodiments of the present invention, there is provided an inference method, the method comprising: obtaining model information of a calculation model for inference, and sending the model information to a target computing device to instruct the target calculation The device uses the inference acceleration resource set in the target computing device to load the calculation model indicated by the model information; obtains the data to be inferred, and sends the data to be inferred to the target computing device to instruct the target calculation The device uses the inference acceleration resource to call the loaded calculation model, and uses the calculation model to perform inference processing on the data to be inferred; and receives the result of the inference processing fed back by the target computing device.

根据本发明实施例的第三方面，提供了另一种推理方法，所述方法包括：获取源计算设备发送的用于推理的计算模型的模型信息，通过推理加速资源载入所述模型信息指示的计算模型；获取所述源计算设备发送的待推理数据，使用推理加速资源调用载入的所述计算模型，通过所述计算模型对所述待推理数据进行推理处理；向所述源计算设备反馈所述推理处理的结果。According to a third aspect of the embodiments of the present invention, there is provided another reasoning method, the method comprising: obtaining model information of a calculation model used for reasoning sent by a source computing device, and loading the model information indication through the reasoning acceleration resource The calculation model; obtain the data to be inferred sent by the source computing device, use the inference acceleration resource to call the loaded calculation model, and perform inference processing on the data to be inferred through the calculation model; to the source computing device The result of the inference processing is fed back.

根据本发明实施例的第四方面，提供了一种电子设备，包括：处理器、存储器、通信接口和通信总线，所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信；所述存储器用于存放至少一可执行指示，所述可执行指示使所述处理器执行如第二方面所述的推理方法对应的操作，或者，所述可执行指示使所述处理器执行如第三方面所述的推理方法对应的操作。According to a fourth aspect of the embodiments of the present invention, there is provided an electronic device, including: a processor, a memory, a communication interface, and a communication bus. The processor, the memory, and the communication interface complete each other through the communication bus. Inter-communication; the memory is used to store at least one executable instruction, the executable instruction causes the processor to perform the operation corresponding to the inference method described in the second aspect, or the executable instruction causes the The processor executes operations corresponding to the inference method described in the third aspect.

根据本发明实施例的第五方面，提供了一种计算机存储介质，其上存储有计算机程序，该程序被处理器执行时实现如第二方面所述的推理方法；或者，实现如第三方面所述的推理方法。According to a fifth aspect of the embodiments of the present invention, there is provided a computer storage medium having a computer program stored thereon, and when the program is executed by a processor, the reasoning method described in the second aspect is implemented; or, the third aspect is implemented The reasoning method described.

根据本发明实施例提供的推理方案，将推理处理部署在不同的第一和第二计算设备中，其中，第二计算设备中设置有推理加速资源，可以通过计算模型进行主要的推理处理，而第一计算设备则可以负责推理处理之前和之后的数据处理。并且，第一计算设备中部署有推理客户端，第二计算设备中部署有推理服务端，在进行推理时，第一计算设备和第二计算设备通过推理客户端和推理服务端进行交互。推理客户端可以先将计算模型的模型信息发送给推理服务端，推理服务端使用推理加速资源载入相应的计算模型；接着，推理客户端向推理服务端发送待推理数据，推理服务端在接收到待推理数据后，即可通过载入的计算模型进行推理处理。由此，实现了推理所使用的计算资源的解耦，通过计算模型进行的推理处理和推理处理之外的数据处理可以通过不同的计算设备实现，其中一台配置有推理加速资源如GPU即可，无需一台电子设备上同时具有CPU和GPU，有效解决了因现有异构计算机器的CPU/GPU的规格固定，而使涉及推理的应用的部署受限，导致无法满足广泛的推理场景需求的问题。According to the reasoning scheme provided by the embodiment of the present invention, the reasoning process is deployed in different first and second computing devices, where the second computing device is provided with inference acceleration resources, and the main reasoning process can be performed through the calculation model, and The first computing device can then be responsible for data processing before and after the inference processing. In addition, an inference client is deployed in the first computing device, and an inference server is deployed in the second computing device. During inference, the first computing device and the second computing device interact through the inference client and the inference server. The inference client can first send the model information of the calculation model to the inference server, and the inference server uses inference acceleration resources to load the corresponding calculation model; then, the inference client sends the data to be inferred to the inference server, and the inference server receives After the data to be inferred, inference processing can be performed through the loaded calculation model. In this way, the decoupling of computing resources used for inference is realized. Inference processing through the computing model and data processing other than inference processing can be realized by different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem that the fixed specifications of the CPU/GPU of the existing heterogeneous computing machines limit the deployment of inference-related applications, which makes it impossible to meet the needs of a wide range of inference scenarios The problem.

此外，对于用户来说，其在使用涉及推理的应用时，推理计算可以通过推理客户端和推理服务端无缝转接到远程具有推理加速资源的设备上进行，且推理客户端和推理服务端之间的交互对于用户是无感知的，因此，可以保证涉及推理的应用的业务逻辑和用户进行推理业务的使用习惯不变，低成本地实现了推理且提升了用户使用体验。In addition, for users, when using applications involving reasoning, reasoning calculations can be seamlessly transferred to remote devices with reasoning acceleration resources through the reasoning client and the reasoning server, and the reasoning client and the reasoning server can be seamlessly transferred. The interaction between them is unaware to the user. Therefore, it can ensure that the business logic of the application involving reasoning and the user's usage habits for reasoning services remain unchanged, and the reasoning is realized at low cost and the user experience is improved.

附图说明Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明实施例中记载的一些实施例，对于本领域普通技术人员来讲，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some of the embodiments described in the embodiments of the present invention. For those of ordinary skill in the art, other drawings may be obtained based on these drawings.

图1为现有技术中的一种具有推理计算资源的电子设备的结构示意图；Fig. 1 is a schematic structural diagram of an electronic device with inference calculation resources in the prior art;

图2a为根据本发明实施例一的一种推理***的结构框图；Figure 2a is a structural block diagram of an inference system according to the first embodiment of the present invention;

图2b为根据本发明实施例的一种推理***实例的结构示意图；Figure 2b is a schematic structural diagram of an example of an inference system according to an embodiment of the present invention;

图3a为根据本发明实施例二的一种推理***的结构框图；Figure 3a is a structural block diagram of an inference system according to the second embodiment of the present invention;

图3b为根据本发明实施例的一种推理***实例的结构示意图；Figure 3b is a schematic structural diagram of an example of an inference system according to an embodiment of the present invention;

图3c为使用图3b所示推理***进行推理的过程示意图；Figure 3c is a schematic diagram of the inference process using the inference system shown in Figure 3b;

图3d为使用图3b所示推理***进行推理的交互示意图；Fig. 3d is a schematic diagram of interaction for reasoning using the reasoning system shown in Fig. 3b;

图4为根据本发明实施例三的一种推理方法的流程图；Fig. 4 is a flowchart of an inference method according to the third embodiment of the present invention;

图5为根据本发明实施例四的一种推理方法的流程图；Fig. 5 is a flowchart of an inference method according to the fourth embodiment of the present invention;

图6为根据本发明实施例五的一种推理方法的流程图；Fig. 6 is a flowchart of an inference method according to the fifth embodiment of the present invention;

图7为根据本发明实施例六的一种推理方法的流程图；Fig. 7 is a flowchart of an inference method according to the sixth embodiment of the present invention;

图8为根据本发明实施例七的一种电子设备的结构示意图；Fig. 8 is a schematic structural diagram of an electronic device according to a seventh embodiment of the present invention;

图9为根据本发明实施例八的一种电子设备的结构示意图。Fig. 9 is a schematic structural diagram of an electronic device according to the eighth embodiment of the present invention.

具体实施方式Detailed ways

为了使本领域的人员更好地理解本发明实施例中的技术方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本发明实施例一部分实施例，而不是全部的实施例。基于本发明实施例中的实施例，本领域普通技术人员所获得的所有其他实施例，都应当属于本发明实施例保护的范围。In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the description is The embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments in the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art should fall within the protection scope of the embodiments of the present invention.

下面结合本发明实施例附图进一步说明本发明实施例具体实现。The specific implementation of the embodiments of the present invention will be further described below in conjunction with the drawings of the embodiments of the present invention.

实施例一Example one

参照图2a，示出了根据本发明实施例一的一种推理***的结构框图。Referring to Fig. 2a, a structural block diagram of an inference system according to the first embodiment of the present invention is shown.

本实施例的推理***包括相互连接的第一计算设备202和第二计算设备204，其中，第一计算设备202中设置有推理客户端2022，第二计算设备204中设置有推理服务端2042以及推理加速资源2044。The inference system of this embodiment includes a first computing device 202 and a second computing device 204 connected to each other, wherein the first computing device 202 is provided with an inference client 2022, and the second computing device 204 is provided with an inference server 2042 and Inference acceleration resources 2044.

其中，推理客户端2022用于获取进行推理的计算模型的模型信息和待推理数据，并分别将所述模型信息和所述待推理数据发送至所述第二计算设备204中的推理服务端2042；推理服务端2042用于通过推理加速资源2044载入并调用所述模型信息指示的计算模型，通过所述计算模型对所述待推理数据进行推理处理并向推理客户端2022反馈所述推理处理的结果。Wherein, the inference client 2022 is used to obtain model information and data to be inferred of the calculation model for inference, and respectively send the model information and the data to be inferred to the inference server 2042 in the second computing device 204 The reasoning server 2042 is used to load and call the calculation model indicated by the model information through the reasoning acceleration resource 2044, perform reasoning processing on the data to be reasoned through the calculation model, and feedback the reasoning processing to the reasoning client 2022 the result of.

在一种可行的实现中，第一计算设备202中的推理客户端2022先获取进行推理的计算模型的模型信息，并将所述模型信息发送至第二计算设备204中的推理服务端2042；第二计算设备204中的推理服务端2042通过推理加速资源2044载入所述模型信息指示的计算模型；第一计算设备202中的推理客户端2022再获取待推理数据，并将待推理数据发送至第二计算设备204中的推理服务端2042；第二计算设备204中的推理服务端2042使用推理加速资源2044调用载入的所述计算模型，通过所述计算模型对待推理数据进行推理处理并向推理客户端2022反馈所述推理处理的结果。In a feasible implementation, the inference client 2022 in the first computing device 202 first obtains model information of the computing model for inference, and sends the model information to the inference server 2042 in the second computing device 204; The inference server 2042 in the second computing device 204 loads the calculation model indicated by the model information through the inference acceleration resource 2044; the inference client 2022 in the first computing device 202 then obtains the data to be inferred, and sends the data to be inferred To the inference server 2042 in the second computing device 204; the inference server 2042 in the second computing device 204 uses the inference acceleration resource 2044 to call the loaded calculation model, and perform inference processing on the inference data through the calculation model and The result of the inference processing is fed back to the inference client 2022.

上述推荐***中，第二计算设备204因具有推理使用的推理加速资源2044，可以有效加载用于推理的计算模型和进行大数据量的推理计算。又因推理加速资源2044所在的第二计算设备204与第一计算设备202相互独立设置，推理加速资源2044如GPU无需和第一计算设备202中的处理器资源如CPU遵循固定的规格设置，使得推理加速资源2044的实现更为灵活和多样。其中，推理加速资源2044可以实现为：包括但不限于GPU、NPU的多种形式。从而，第一计算设备中202中配置常规处理数据的资源如CPU即可。In the above recommendation system, because the second computing device 204 has the reasoning acceleration resource 2044 used for reasoning, it can effectively load a calculation model for reasoning and perform reasoning calculations with a large amount of data. Also, because the second computing device 204 where the inference acceleration resource 2044 is located and the first computing device 202 are independently set, the inference acceleration resource 2044, such as the GPU, does not need to follow a fixed specification setting with the processor resources in the first computing device 202, such as the CPU, so that The realization of the inference acceleration resource 2044 is more flexible and diverse. Among them, the inference acceleration resource 2044 can be implemented in various forms including but not limited to GPU and NPU. Therefore, it is only necessary to configure a resource such as a CPU for conventional data processing in 202 in the first computing device.

进行推理的计算模型可以为根据业务需求设置的任意适当的计算模型，其可适用于深度学习框架(包括但不限于Tensorflow框架、Mxnet框架、PyTorch框架)即可。在一种可行方式中，第二计算设备204中可以预先设置有计算模型的资源池，若所需使用的计算模型在资源池中，则可直接加载使用；若不在资源池中，则可从第一计算设备202中获取。在另一种可行方式中，第二计算设备204中可以没有预先设置的计算模型的资源池，在需要进行推理时，从第一计算设备202获取所需的计算模型，进而存储在本地。经过多次推理，可以获得不同的计算模型并最终存储成为计算模型的资源池。获得的不同的计算模型可以来自于不同的第一计算设备202，也即，第二计算设备204可以为不同的第一计算设备202提供推理服务，以从不同的第一计算设备202获得不同的计算模型。The calculation model for reasoning may be any appropriate calculation model set according to business requirements, which may be applicable to deep learning frameworks (including but not limited to Tensorflow framework, Mxnet framework, and PyTorch framework). In a feasible manner, the second computing device 204 can be pre-configured with a resource pool of the computing model. If the computing model to be used is in the resource pool, it can be directly loaded and used; if it is not in the resource pool, it can be downloaded from Obtained from the first computing device 202. In another feasible manner, the second computing device 204 may not have a pre-set computing model resource pool. When inference is needed, the required computing model is obtained from the first computing device 202 and then stored locally. After multiple inferences, different calculation models can be obtained and finally stored as a resource pool of the calculation model. The obtained different computing models can come from different first computing devices 202, that is, the second computing device 204 can provide inference services for different first computing devices 202, so as to obtain different first computing devices 202. Calculation model.

推理客户端2022向推理服务端2042发送的计算模型的模型信息可以唯一标识计算模型，如，可以为计算模型的标识信息如标识ID号。但不限于此，在一种可行方式中，计算模型的模型信息也可以是计算模型的校验信息，如MD5信息，校验信息一方面可以标识计算模型，另一方面还可以进行计算模型的校验，通过一种信息实现多种功能，降低了信息处理的成本。模型信息可以在第一计算设备202加载模型时获得。The model information of the calculation model sent by the inference client 2022 to the inference server 2042 may uniquely identify the calculation model, for example, it may be the identification information of the calculation model, such as an identification ID number. But it is not limited to this. In a feasible way, the model information of the calculation model can also be the verification information of the calculation model, such as MD5 information. The verification information can identify the calculation model on the one hand, and can also perform the calculation model on the other hand. Verification, through one type of information to achieve multiple functions, reducing the cost of information processing. The model information can be obtained when the first computing device 202 loads the model.

以下，以一个具体实例对上述推理***的结构进行示例性说明，如图2b所示。Hereinafter, a specific example is used to illustrate the structure of the above reasoning system, as shown in Fig. 2b.

图2b中，第一计算设备202实现为一台终端设备即第一终端设备，其中设置有处理器CPU以进行相应的业务处理，第二计算设备204也实现为一台终端设备即第二终端设备，其中设置有推理加速资源GPU。并且，第一计算设备202中加载有深度学习框架，以及设置于深度学习框架中的推理客户端；第二计算设备204中则对应设置有推理服务端。本实施例中，还设定第二计算设备204中设置有计算模型的资源池，其中存储有多个计算模型，如计算模型A、B、C和D。In Figure 2b, the first computing device 202 is implemented as a terminal device, that is, the first terminal device, in which a processor CPU is provided for corresponding service processing, and the second computing device 204 is also implemented as a terminal device, that is, the second terminal. The device, which is provided with the inference acceleration resource GPU. In addition, the first computing device 202 is loaded with a deep learning framework and a reasoning client set in the deep learning framework; the second computing device 204 is correspondingly set with a reasoning server. In this embodiment, it is also set that the second computing device 204 is provided with a resource pool of computing models, in which multiple computing models, such as computing models A, B, C, and D are stored.

本领域技术人员应当明白的是，上述实例仅为示例性说明，在实际应用中，第一计算设备202和第二计算设备204可以均实现为终端设备，或者可以均实现为服务器，或者，第一计算设备202实现为服务器而第二计算设备204实现为终端设备或者相反，本发明实施例对此不作限制。It should be understood by those skilled in the art that the above examples are only illustrative. In practical applications, both the first computing device 202 and the second computing device 204 can be implemented as terminal devices, or both can be implemented as servers, or the first computing device 202 and the second computing device 204 can be implemented as a server. One computing device 202 is implemented as a server and the second computing device 204 is implemented as a terminal device or vice versa, which is not limited in the embodiment of the present invention.

基于图2b的推理***，一种使用该推理***进行推理的过程如下。Based on the reasoning system in Figure 2b, a process of using the reasoning system for reasoning is as follows.

以图像识别为例，深度学习框架载入模型时即可获得相应的模型信息，推理客户端将计算模型的信息先发送给第二终端设备，第二终端设备通过其中的推理服务端接收该计算模型的信息。假设，该计算模型的信息指示待使用的计算模型为计算模型A，第二终端设备的资源池中存储有计算模型A、B、C和D，则，第二终端设备会通过GPU直接从资源池中加载计算模型A。进而，第二终端设备通过推理服务端和推理客户端，从第一终端设备获取待推理数据如待识别的图像，然后第二终端设备通过GPU调用计算模型A对该图像进行目标对象识别，如识别图像中是否有人像。在识别结束后，第二终端设备会将识别结果通过推理服务端发送给第一终端设备的推理客户端，由推理客户端交由CPU进行后续处理，如增加AR特效等。Taking image recognition as an example, the corresponding model information can be obtained when the deep learning framework is loaded into the model. The inference client first sends the information of the calculation model to the second terminal device, and the second terminal device receives the calculation through the inference server. Model information. Assuming that the information of the calculation model indicates that the calculation model to be used is calculation model A, and the resource pool of the second terminal device stores calculation models A, B, C, and D, then the second terminal device will directly access the resource from the resource through the GPU. Load calculation model A in the pool. Furthermore, the second terminal device obtains the data to be inferred, such as the image to be recognized, from the first terminal device through the inference server and the inference client, and then the second terminal device uses the GPU to call the calculation model A to perform target object recognition on the image, such as Identify whether there are people in the image. After the recognition, the second terminal device will send the recognition result to the inference client of the first terminal device through the inference server, and the inference client will hand it over to the CPU for subsequent processing, such as adding AR special effects.

需要说明的是，本发明实施例中，若无特殊说明，诸如“多个”、“多种”等与“多”有关的数量，意指两个及两个以上。It should be noted that, in the embodiments of the present invention, unless otherwise specified, quantities such as "multiple" and "multiple" related to "multiple" mean two or more.

根据本实施例提供的推理***，将推理处理部署在不同的第一和第二计算设备中，其中，第二计算设备中设置有推理加速资源，可以通过计算模型进行主要的推理处理，而第一计算设备则可以负责推理处理之前和之后的数据处理。并且，第一计算设备中部署有推理客户端，第二计算设备中部署有推理服务端，在进行推理时，第一计算设备和第二计算设备通过推理客户端和推理服务端进行交互。推理客户端可以先将计算模型的模型信息发送给推理服务端，推理服务端使用推理加速资源载入相应的计算模型；接着，推理客户端向推理服务端发送待推理数据，推理服务端在接收到待推理数据后，即可通过载入的计算模型进行推理处理。由此，实现了推理所使用的计算资源的解耦，通过计算模型进行的推理处理和推理处理之外的数据处理可以通过不同的计算设备实现，其中一台配置有推理加速资源如GPU即可，无需一台电子设备上同时具有CPU和GPU，有效解决了因现有异构计算机器的CPU/GPU的规格固定，而使涉及推理的应用的部署受限，导致无法满足广泛的推理场景需求的问题。According to the inference system provided in this embodiment, the inference processing is deployed in different first and second computing devices. The second computing device is provided with inference acceleration resources, and the main inference processing can be performed through the computing model. A computing device can then be responsible for data processing before and after inference processing. In addition, an inference client is deployed in the first computing device, and an inference server is deployed in the second computing device. During inference, the first computing device and the second computing device interact through the inference client and the inference server. The inference client can first send the model information of the calculation model to the inference server, and the inference server uses inference acceleration resources to load the corresponding calculation model; then, the inference client sends the data to be inferred to the inference server, and the inference server receives After the data to be inferred, inference processing can be performed through the loaded calculation model. In this way, the decoupling of computing resources used for inference is realized. Inference processing through the computing model and data processing other than inference processing can be realized by different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem that the fixed specifications of the CPU/GPU of the existing heterogeneous computing machines limit the deployment of inference-related applications, which makes it impossible to meet the needs of a wide range of inference scenarios The problem.

实施例二Example two

本实施例对实施例一中的推理***进行了进一步优化，如图3a所示。This embodiment further optimizes the reasoning system in the first embodiment, as shown in Fig. 3a.

如实施例一中所述，本实施例的推理***包括：相互连接的第一计算设备202和第二计算设备204，其中，第一计算设备202中设置有推理客户端2022，第二计算设备204中设置有推理服务端2042以及推理加速资源2044。As described in the first embodiment, the inference system of this embodiment includes: a first computing device 202 and a second computing device 204 connected to each other, wherein the first computing device 202 is provided with an inference client 2022, and the second computing device The reasoning server 2042 and the reasoning acceleration resource 2044 are provided in 204.

第一计算设备202中的推理客户端2022用于获取进行推理的计算模型的模型信息，并将所述模型信息发送至第二计算设备204中的推理服务端2042；第二计算设备204中的推理服务端2042用于通过推理加速资源2044载入所述模型信息指示的计算模型；第一计算设备202中的推理客户端2022还用于获取待推理数据，并将待推理数据发送至第二计算设备204中的推理服务端2042；第二计算设备204中的推理服务端2042还用于使用推理加速资源2044调用载入的所述计算模型，通过所述计算模型对待推理数据进行推理处理并向推理客户端2022反馈所述推理处理的结果。The inference client 2022 in the first computing device 202 is used to obtain model information of the computing model for inference, and send the model information to the inference server 2042 in the second computing device 204; The reasoning server 2042 is used to load the calculation model indicated by the model information through the reasoning acceleration resource 2044; the reasoning client 2022 in the first computing device 202 is also used to obtain the data to be reasoned, and send the data to be reasoned to the second The inference server 2042 in the computing device 204; the inference server 2042 in the second computing device 204 is also used to use the inference acceleration resource 2044 to call the loaded calculation model, and use the calculation model to perform inference processing on the inference data to be inferred. The result of the inference processing is fed back to the inference client 2022.

在一种可行方式中，第一计算设备202和第二计算设备204通过弹性网络相互连接。其中，所述弹性网络包括但不限于ENI(Elastic Network Interface)网络。弹性网络具有更好的可扩展性和灵活性，通过弹性网络连接第一计算设备202和第二计算设备204，使得推理***也具备更好的可扩展性和灵活性。但不限于此，在实际应用中，第一计算设备202和第二计算设备204之间可以采用任意适当的方式或网络相连，能够顺利实现两者的数据交互即可。In one feasible manner, the first computing device 202 and the second computing device 204 are connected to each other through an elastic network. The elastic network includes, but is not limited to, an ENI (Elastic Network Interface) network. The elastic network has better scalability and flexibility. The first computing device 202 and the second computing device 204 are connected through the elastic network, so that the reasoning system also has better scalability and flexibility. However, it is not limited to this. In actual applications, the first computing device 202 and the second computing device 204 can be connected in any suitable manner or network, so that the data interaction between the two can be smoothly realized.

在本实施例中，可选地，推理客户端2022可以实现为嵌入第一计算设备202中的深度学***台，基于深度学习框架，程序人员可以方便地部署各种计算模型，实现不同的推理功能。将推理客户端2022实现为适用于深度学习框架的组件或可调用文件的形式，一方面使其兼容性和适用性更好，另一方面，也大大降低了推理计算资源解耦的实现成本。类似地，推理服务端2042也同样可实现为组件或可调用文件的形式。In this embodiment, optionally, the inference client 2022 can be implemented as a component embedded in the deep learning framework in the first computing device 202, or the inference client can be implemented as a component that can be called by the deep learning framework. The callable file. The deep learning framework provides a platform for deep learning. Based on the deep learning framework, programmers can easily deploy various computing models to achieve different inference functions. Implementing the inference client 2022 in the form of a component or callable file suitable for a deep learning framework on the one hand makes it more compatible and applicable, on the other hand, it also greatly reduces the implementation cost of decoupling inference computing resources. Similarly, the reasoning server 2042 can also be implemented in the form of a component or a callable file.

基于上述结构，本实施例的推理***可以方便地通过推理客户端2022和推理服务端2042进行相应数据和信息的交互，通过远程调用推理加速资源2044实现推理处理。Based on the above structure, the reasoning system of this embodiment can conveniently exchange corresponding data and information through the reasoning client 2022 and the reasoning server 2042, and realize reasoning processing by remotely calling the reasoning acceleration resource 2044.

此外，本实施例中，推理客户端2022还用于在确定第二计算设备204中不存在所述计算模型时，将所述计算模型发送至推理服务端2042。In addition, in this embodiment, the inference client 2022 is also configured to send the calculation model to the inference server 2042 when it is determined that the calculation model does not exist in the second computing device 204.

可选地，所述计算模型的模型信息为所述计算模型的标识信息或校验信息；推理服务端2042还用于通过所述标识信息或所述校验信息，确定第二计算设备204中是否存在所述计算模型，并将确定结果返回给推理客户端2022。但不限于此，其它确定第二计算设备204是否存在所述计算模型的方式也同样适用，如，每隔一定时间第二计算设备204向外广播其所具有的计算模型；或者，在需要时或每隔一定时间，第一计算设备202主动发送消息询问第二计算设备204中的计算模型的资源情况，等等。Optionally, the model information of the calculation model is identification information or verification information of the calculation model; the reasoning server 2042 is further configured to determine that the second computing device 204 is in the second computing device 204 through the identification information or the verification information. Whether the calculation model exists, and the determination result is returned to the inference client 2022. However, it is not limited to this, and other methods of determining whether the second computing device 204 has the calculation model are also applicable, for example, the second computing device 204 broadcasts the calculation model it has at regular intervals; or, when needed Or at regular intervals, the first computing device 202 actively sends a message to inquire about the resource status of the computing model in the second computing device 204, and so on.

例如，若第二计算设备204中未预先设置计算模型的资源池或者资源池中没有所需的计算模型时，推理客户端2022会将第一计算设备202中的计算模型发送给第二计算设备204，包括但不限于计算模型的结构和其包含的数据。在推理客户端2022向推理服务端2042发送的计算模型的信息为标识信息或校验信息的情况下，推理服务端2042会先根据接收到的标识信息或校验信息来确定第二计算设备204中是否存在所需的计算模型，并将确定结果返回给推理客户端2022。若该确定结果指示第二计算设备204中不存在所需的计算模型时，第一计算设备202从本地获取计算模型并将其发送给第二计算设备204，借由第二计算设备204的推理加速资源运行该计算模型，进行推理处理。For example, if the resource pool of the calculation model is not preset in the second computing device 204 or there is no required calculation model in the resource pool, the inference client 2022 will send the calculation model in the first computing device 202 to the second computing device 204, including but not limited to the structure of the calculation model and the data it contains. In the case that the calculation model information sent by the inference client 2022 to the inference server 2042 is identification information or verification information, the inference server 2042 will first determine the second computing device 204 according to the received identification information or verification information. Whether the required calculation model exists in the database, and the determination result is returned to the inference client 2022. If the result of the determination indicates that the required calculation model does not exist in the second computing device 204, the first computing device 202 obtains the calculation model locally and sends it to the second computing device 204, based on the inference of the second computing device 204 Accelerate resources to run the calculation model and perform inference processing.

通过这种方式，可以有效保证具备推理加速资源的第二计算设备204能够顺利完成推理处理。In this way, it can be effectively ensured that the second computing device 204 with inference acceleration resources can successfully complete inference processing.

除此之外，在一种可行方式中，推理客户端2022还用于获取请求所述计算模型对所述待推理数据进行推理处理的推理请求，并对所述推理请求进行语义分析，根据语义分析结果确定待调用的所述计算模型中的处理函数，将所述处理函数的信息发送给所述推理服务端2042；所述推理服务端2042在通过所述计算模型对所述待推理数据进行推理处理时，通过调用载入的所述计算模型中所述处理函数的信息指示的处理函数，对所述待推理数据进行推理处理。In addition, in a feasible manner, the reasoning client 2022 is also used to obtain a reasoning request requesting the calculation model to perform reasoning processing on the data to be reasoned, and perform semantic analysis on the reasoning request, according to the semantics The analysis result determines the processing function in the calculation model to be called, and sends the information of the processing function to the inference server 2042; the inference server 2042 is performing the calculation on the data to be inferred through the calculation model. During inference processing, inference processing is performed on the data to be inferred by calling the processing function indicated by the processing function information in the loaded calculation model.

在一些推理的具体业务应用中，业务可能并不需要计算模型所有的推理功能，仅需部分功能即可。例如，某个推理用于对文字内容进行分类，而当前业务仅需其中的计算功能对相应文字向量进行加法计算，则此种情况下，当推理客户端2022接收到请求某个计算模型对文字向量进行加法计算的推理请求后，通过对该推理请求进行语义分析，确定仅需调用该计算模型中的COMPUTE()函数即可，则可将该函数的信息发送给推理服务端2042。推理服务端2042在获得该函数的信息后，可直接调用计算模型中的COMPUTE()函数进行文字向量的加法计算即可。In some specific business applications of reasoning, the business may not need all the reasoning functions of the calculation model, but only part of the functions. For example, a certain reasoning is used to classify text content, and the current business only needs the calculation function in it to add the corresponding text vector. In this case, when the reasoning client 2022 receives a request for a calculation model to calculate the text After the vector performs the reasoning request of the addition calculation, through semantic analysis of the reasoning request, it is determined that only the COMPUTE() function in the calculation model needs to be called, and then the information of the function can be sent to the reasoning server 2042. After the reasoning server 2042 obtains the information of the function, it can directly call the COMPUTE() function in the calculation model to perform the addition calculation of the text vector.

可见，通过这种方式，使得计算模型的使用更为精准，大大提高了计算模型的推理效率，并且，降低了推理负担。It can be seen that in this way, the use of the calculation model is more accurate, the reasoning efficiency of the calculation model is greatly improved, and the reasoning burden is reduced.

在一种可行方式中，处理函数的信息可以为处理函数的API接口信息，通过API接口信息既可快速地确定待使用的计算模型中的处理函数，还可以直接获取该函数相应的接口信息，以在后续进行推理处理时直接使用。In a feasible way, the processing function information can be the API interface information of the processing function. Through the API interface information, the processing function in the calculation model to be used can be quickly determined, and the corresponding interface information of the function can also be directly obtained. It can be used directly in the subsequent inference processing.

可选地，第二计算设备204中设置有一种或多种类型的推理加速资源；当推理加速资源包括多种类型时，不同类型的推理加速资源具有不同的使用优先级；推理服务端2042根据预设的负载均衡规则和多种类型的推理加速资源的优先级，使用推理加速资源。Optionally, one or more types of inference acceleration resources are provided in the second computing device 204; when the inference acceleration resources include multiple types, different types of inference acceleration resources have different usage priorities; the reasoning server 2042 is based on The preset load balancing rules and the priority of various types of inference acceleration resources, use the inference acceleration resources.

例如，第二计算设备204中除设置有GPU外，还可以设置NPU或其它推理加速资源。多种推理加速资源间具有一定的优先级，所述优先级可以通过任意适当方式设置，如根据运行速度设置或者人工设置等等，本发明实施例对此不作限制。进一步可选地，第二计算设备204中还可以设置CPU。此种情况下，可以设置GPU的使用优先级最高、NPU次之、CPU优先级最低。For example, in addition to the GPU provided in the second computing device 204, an NPU or other inference acceleration resources may also be provided. There is a certain priority among multiple inference acceleration resources, and the priority can be set in any appropriate manner, such as setting according to the running speed or manually, etc., which is not limited in the embodiment of the present invention. Further optionally, a CPU may also be provided in the second computing device 204. In this case, you can set the GPU to use the highest priority, the NPU second, and the CPU priority to be the lowest.

通过这种方式，当高优先级的推理加速资源负荷较重时，则可使用优先级较低的推理加速资源进行推理处理。一方面，保证了推理处理可被有效执行，另一方面，也可降低推理加速资源的成本。需要说明的是，某一类型的推理加速资源的数量可能为一个，也可能为多个，由本领域技术人员根据需求设置，本发明实施例对此不作限制。此外，预设的负载均衡规则除上述按优先级进行负载均衡外，本领域技术人员还可以根据实际需要设置其它适当的负载均衡规则，本发明实施例对此不作限制。In this way, when high-priority inference acceleration resources are heavily loaded, lower-priority inference acceleration resources can be used for inference processing. On the one hand, it ensures that inference processing can be effectively executed, and on the other hand, it can also reduce the cost of inference acceleration resources. It should be noted that the number of a certain type of reasoning acceleration resource may be one or multiple, which is set by a person skilled in the art according to requirements, and the embodiment of the present invention does not limit this. In addition, in addition to the above-mentioned load balancing according to the priority of the preset load balancing rules, those skilled in the art can also set other appropriate load balancing rules according to actual needs, which is not limited in the embodiment of the present invention.

以下，以一个具体实例，对本实施例中的推理***进行说明。Hereinafter, a specific example is used to describe the inference system in this embodiment.

如图3b所示，与传统CPU和推理加速资源如GPU在同一台电子设备的架构不同，本实例中，将CPU和推理加速资源解耦成两部分，即图3b中的CPU client machine(前台客户机器)和Server accelerator pools(后台推理加速卡机器)。其中，前台客户机器是用户可操作的用于执行推理业务的机器，后台推理加速卡机器用于推理的计算，两者的通信通过ENI实现。在前台客户机器中设置有多个推理框架，图中示意为“Tensorflow inference code”、“pyTorch inferce code”和“Mxnet inference code”。在后台推理加速卡机器中设置有多种推理加速卡，图中示意为“GPU”、“Ali-NPU”和“Other Accelerator”。As shown in Figure 3b, unlike the traditional CPU and inference acceleration resources such as GPU in the same electronic device architecture, in this example, the CPU and inference acceleration resources are decoupled into two parts, namely the CPU client machine in Figure 3b. Client machine) and Server accelerator pools (background reasoning accelerator card machine). Among them, the front-end client machine is a user-operable machine for performing reasoning services, and the back-end reasoning accelerator card machine is used for reasoning calculations, and the communication between the two is realized through ENI. There are multiple inference frameworks in the front-end client machine, as shown in the figure are "Tensorflow inference code", "pyTorch inferce code" and "Mxnet inference code". There are a variety of inference accelerator cards in the background inference accelerator card machine, shown in the figure as "GPU", "Ali-NPU" and "Other Accelerator".

为了实现将前台客户机器的推理业务转发到后台加速卡执行并返回推理的结果以实现用户侧无感支持，本实例中提供了两个组件分别常驻于前台客户机器和后台加速卡机器，分别是EAI client module(即推理客户端)和Service daemon(即推理服务端)。In order to forward the inference business of the front-end client machine to the back-end accelerator card for execution and return the result of the inference to achieve user-side insensible support, this example provides two components that reside in the front-end client machine and the back-end accelerator card machine respectively. It is the EAI client module (that is, the reasoning client) and the Service daemon (that is, the reasoning server).

其中，EAI client module是前台客户机器中的组件，其功能包括：a)和后台Service daemon通过网络连接通信；b)解析计算模型的语义和推理请求；c)将计算模型的语义的解析结果和推理请求的解析结果发送给后台的Service daemon；d)接收Service daemon发送的推理结果并返回给深度学习框架。在一种实现方式中，EAI client module可以实现为plugin模块嵌入到深度学习框架(如Tensorflow/pyTorch/Mxnet等)的功能代码里，在推理业务通过深度学习框架加载计算模型的时候，EAI client module会截获载入的计算模型，解析计算模型的语义以生成计算模型的信息，如校验信息(可为MD5信息)，将计算模型和/或计算模型的信息，以及后续操作转接到后端Service daemon进行实际的推理计算。Among them, the EAI client module is a component in the front-end client machine, and its functions include: a) communicate with the back-end Service daemon through a network connection; b) analyze the semantics of the calculation model and inference requests; c) analyze the semantics of the calculation model and The analysis result of the inference request is sent to the backend Service daemon; d) The inference result sent by the Service daemon is received and returned to the deep learning framework. In one implementation, the EAI client module can be implemented as a plugin module embedded in the functional code of a deep learning framework (such as Tensorflow/pyTorch/Mxnet, etc.). When the inference business loads the calculation model through the deep learning framework, the EAI client module It will intercept the loaded calculation model, analyze the semantics of the calculation model to generate calculation model information, such as verification information (may be MD5 information), transfer calculation model and/or calculation model information, and transfer subsequent operations to the backend The service daemon performs actual inference calculations.

Service daemon是后台推理加速卡机器的常驻服务组件，其功能包括：a)接收EAI client module发送的计算模型的信息和推理请求的解析结果；b)根据计算模型的信息和推理请求的解析结果，选取后台推理加速卡机器中最优的推理加速卡；c)将推理计算下发给推理加速卡；d)接收推理加速卡计算的推理结果并返回给EAI client module。Service daemon is a resident service component of the background reasoning accelerator card machine. Its functions include: a) receiving the calculation model information sent by the EAI client module and the analysis result of the reasoning request; b) according to the calculation model information and the analysis result of the reasoning request , Select the best reasoning accelerator card in the background reasoning accelerator card machine; c) Send the reasoning calculation to the reasoning accelerator card; d) Receive the reasoning result calculated by the reasoning accelerator card and return it to the EAI client module.

其中，GPU、Ali-NPU和Other Accelerator之间具有一定的优先级，如，从高到低依次为GPU->Ali-NPU->Other Accelerator，则在实际使用时，优先使用GPU，若GPU资源不够再使用Ali-NPU，若Ali-NPU的资源仍不够，再使用Other Accelerator。Among them, GPU, Ali-NPU and Other Accelerator have a certain priority. For example, from high to low, it is GPU->Ali-NPU->Other Accelerator. In actual use, GPU is used first. If GPU resources are If Ali-NPU is not enough, if Ali-NPU resources are still insufficient, use Other Accelerator.

可见，与传统通过PCIE卡槽将CPU和GPU推理加速卡绑定在一台机器不同的是，本实例中的弹性远程推理通过弹性网卡将CPU机器(CPU client machine)和推理加速卡(server accelerator pools)解耦，对用户来说，购买CPU和GPU同台的机器不再成为必须。It can be seen that, unlike the traditional binding of CPU and GPU inference accelerator cards to one machine through the PCIE card slot, the elastic remote inference in this example combines the CPU client machine and the inference accelerator card (server accelerator) through the elastic network card. Pools) decoupling, for users, it is no longer necessary to buy a machine with the same CPU and GPU.

基于图3b所示推理***的推理过程如图3c所示，包括：步骤①，EAI client module在用户通过深度学习框架启动推理任务载入计算模型的时候截取并解析计算模型的语义，获得计算模型的信息；进而，获取用户的推理请求，并解析推理请求，获得计算模型中待使用的处理函数的信息；步骤②，EAI client module通过弹性网络与sevice daemon连接，将计算模型的信息和处理函数的信息转发给后台的Service daemon；步骤③，Service daemon根据计算模型的信息和处理函数的信息，选取最优的推理加速卡，并通过推理加速卡载入计算模型进行推理计算；步骤④，推理加速卡将推理计算的结果返回给Service daemon；步骤⑤，Service daemon将推理计算的结果通过弹性网络转发给EAI client daemon；步骤⑥，EAI client daemon将推理计算的结果返回给深度学习框架。The reasoning process based on the reasoning system shown in Figure 3b is shown in Figure 3c, including: Step ①, EAI client module intercepts and analyzes the semantics of the calculation model when the user starts the reasoning task through the deep learning framework and loads the calculation model to obtain the calculation model Further, obtain the user’s reasoning request, analyze the reasoning request, and obtain the information of the processing function to be used in the calculation model; step ②, the EAI client module is connected to the service daemon through the elastic network, and the information of the calculation model and the processing function are connected The information is forwarded to the back-end Service daemon; step ③, the service daemon selects the optimal inference accelerator card according to the information of the calculation model and the information of the processing function, and loads the calculation model through the inference accelerator card to perform inference calculation; step ④, inference The accelerator card returns the result of the inference calculation to the service daemon; step ⑤, the service daemon forwards the result of the inference calculation to the EAI client daemon through the elastic network; step ⑥, the EAI client daemon returns the result of the inference calculation to the deep learning framework.

由此，用户在前台客户机器进行推理业务，EAI client module和Service daemon在后台自动将用户的推理业务转发到远程推理加速卡进行推理计算，并将推理计算的结果返回给前台客户机器的深度学习框架，做到了用户的无感弹性推理，无需改动推理代码就能享受到推理加速服务。并且，用户无需购买带有GPU的机器，只需要通过普通的CPU机器就可以完成相同的推理加速效果，且不需要修改任何代码逻辑。As a result, the user performs inference services on the front-end client machine, and the EAI client module and Service daemon automatically forward the user's inference service to the remote inference accelerator card for inference calculation in the background, and return the result of the inference calculation to the front-end client machine for deep learning The framework achieves the user's non-sense and flexible reasoning, and can enjoy the reasoning acceleration service without changing the reasoning code. Moreover, users do not need to purchase a machine with a GPU, and only need to use an ordinary CPU machine to complete the same inference acceleration effect, and there is no need to modify any code logic.

在一个深度学习框架为Tensorflow框架的具体示例中，前台客户机器和后台推理加速卡机器的交互如图3d所示。In a specific example where the deep learning framework is the Tensorflow framework, the interaction between the front-end client machine and the back-end inference accelerator card machine is shown in Figure 3d.

该推理交互包括：步骤1，前台客户机器通过Tensorflow框架载入计算模型；步骤2，EAI client module截获计算模型并校验模型；步骤3，EAI client module与Service daemon建立通道，传送计算模型；步骤4，Service daemon分析计算模型，并根据分析结果从加速卡池中选择最优的推理加速器；步骤5，选择出的推理加速器载入计算模型；步骤6，用户输入图片/文字并发起推理请求；步骤7，Tensorflow框架获取用户输入，EAI client module截获推理请求，解析出待使用的处理函数的信息，和用户输入的待推理数据；步骤8，EAI client module将处理函数的信息和待推理数据传送给Service daemon；步骤9，Service daemon将处理函数的信息和待处理数据转发给推理加速器；步骤10，推理加速器通过计算模型进行推理计算并将推理计算的推理结果发送给Service daemon；步骤11，Service daemon将推理结果传输到EAI client module；步骤12，EAI client module接收推理结果并将推理结果转交给Tensorflow框架；步骤13，Tensorflow框架将推理结果展示给用户。The reasoning interaction includes: Step 1, the front-end client machine loads the calculation model through the Tensorflow framework; Step 2, the EAI client module intercepts the calculation model and verifies the model; Step 3, the EAI client module establishes a channel with the Service daemon, and transmits the calculation model; Steps 4. Service daemon analyzes the calculation model, and selects the optimal reasoning accelerator from the accelerator card pool according to the analysis result; step 5, the selected reasoning accelerator is loaded into the calculation model; step 6, the user enters a picture/text and initiates a reasoning request; Step 7, Tensorflow framework obtains user input, EAI client module intercepts the reasoning request, and parses out the information of the processing function to be used and the data to be inferred by the user; Step 8, EAI client module transmits the information of the processing function and the data to be inferred To the Service daemon; step 9, the service daemon forwards the processing function information and the data to be processed to the inference accelerator; step 10, the inference accelerator performs inference calculation through the calculation model and sends the inference result of the inference calculation to the service daemon; step 11, service The daemon transmits the inference results to the EAI client module; step 12, the EAI client module receives the inference results and transfers the inference results to the Tensorflow framework; step 13, the Tensorflow framework displays the inference results to the user.

由此，实现了Tensorflow框架下的弹性推理过程。As a result, the flexible reasoning process under the Tensorflow framework is realized.

实施例三Example three

参照图4，示出了根据本发明实施例三的一种推理方法的流程图。本实施例的推理方法从第一计算设备的角度，对本发明的推理方法进行说明。Referring to Fig. 4, a flow chart of an inference method according to the third embodiment of the present invention is shown. The reasoning method of this embodiment describes the reasoning method of the present invention from the perspective of the first computing device.

本实施例的推理方法包括以下步骤：The reasoning method of this embodiment includes the following steps:

步骤S302：获取进行推理的计算模型的模型信息，并将所述模型信息发送至目标计算设备，以指示所述目标计算设备使用所述目标计算设备中设置的推理加速资源载入所述模型信息指示的计算模型。Step S302: Obtain the model information of the calculation model for inference, and send the model information to the target computing device to instruct the target computing device to use the inference acceleration resource set in the target computing device to load the model information The indicated calculation model.

本实施例中的目标计算设备的实现可参照前述实施例中的第二计算设备。The implementation of the target computing device in this embodiment can refer to the second computing device in the foregoing embodiment.

本步骤的执行可参照前述多个实施例中推理客户端的相关部分，例如，在深度学习框架载入计算模型时即可获取计算模型的模型信息，进而发送给目标计算设备。目标计算设备接收到模型信息后，通过相应的推理加速资源如GPU载入对应的计算模型。The execution of this step can refer to the relevant parts of the inference client in the foregoing multiple embodiments. For example, the model information of the calculation model can be obtained when the calculation model is loaded in the deep learning framework, and then sent to the target computing device. After the target computing device receives the model information, it loads the corresponding computing model through corresponding inference acceleration resources such as GPU.

步骤S304：获取待推理数据，并将所述待推理数据发送至所述目标计算设备，以指示所述目标计算设备使用推理加速资源调用载入的所述计算模型，通过所述计算模型对所述待推理数据进行推理处理。Step S304: Obtain the data to be inferred, and send the data to be inferred to the target computing device to instruct the target computing device to use the inference acceleration resource to call the loaded calculation model, and to use the calculation model to Describe the data to be inferred for inference processing.

其中，如前所述，待推理数据是使用计算模型进行推理计算的任意适当的数据，在目标计算设备载入计算模型后，即可将待推理数据发送至目标计算设备。目标计算设备收到该待推理数据后，使用推理加速资源如GPU载入的计算模型对待推理数据进行推理处理。Among them, as mentioned above, the data to be inferred is any appropriate data for inference calculation using a calculation model. After the target computing device loads the calculation model, the data to be inferred can be sent to the target computing device. After the target computing device receives the data to be inferred, it uses inference acceleration resources such as the computing model loaded by the GPU to perform inference processing on the inferred data.

步骤S306：接收所述目标计算设备反馈的所述推理处理的结果。Step S306: Receive the result of the inference processing fed back by the target computing device.

在目标计算设备使用GPU载入的计算模型对待推理数据进行推理处理完成后，获得推理处理的结果并将其发送给本实施例的执行主体如推理客户端，推理客户端则接收该推理处理的结果。After the target computing device uses the GPU loaded computing model to perform inference processing on the inference data to be inferred, the result of the inference processing is obtained and sent to the execution subject of this embodiment such as the inference client, and the inference client receives the inference processing result. result.

在具体实现时，本实施例的推理方法可以由前述实施例中第一计算设备的推理客户端实现，上述过程的具体实现也可参照前述实施例中推理客户端的操作，在此不再赘述。In specific implementation, the inference method of this embodiment can be implemented by the inference client of the first computing device in the foregoing embodiment, and the specific implementation of the foregoing process can also refer to the operation of the inference client in the foregoing embodiment, which will not be repeated here.

通过本实施例，将推理处理部署在不同的计算设备中，其中，目标计算设备中设置有推理加速资源，可以通过计算模型进行主要的推理处理，而执行本实施例的推理方法的当前计算设备则可以负责推理处理之前和之后的数据处理。在进行推理时，当前计算设备可以先将计算模型的模型信息发送给目标计算设备，目标计算设备使用推理加速资源载入相应的计算模型；接着，当前计算设备向目标计算设备发送待推理数据，目标计算设备在接收到待推理数据后，即可通过载入的计算模型进行推理处理。由此，实现了推理所使用的计算资源的解耦，通过计算模型进行的推理处理和推理处理之外的数据处理可以通过不同的计算设备实现，其中一台配置有推理加速资源如GPU即可，无需一台电子设备上同时具有CPU和GPU，有效解决了因现有异构计算机器的CPU/GPU的规格固定，而使涉及推理的应用的部署受限，导致无法满足广泛的推理场景需求的问题。Through this embodiment, inference processing is deployed in different computing devices, where inference acceleration resources are set in the target computing device, and the main inference processing can be performed through the computing model, while the current computing device that executes the inference method of this embodiment It can be responsible for the data processing before and after the inference processing. During inference, the current computing device can first send the model information of the computing model to the target computing device, and the target computing device uses inference acceleration resources to load the corresponding computing model; then, the current computing device sends the data to be inferred to the target computing device. After the target computing device receives the data to be inferred, it can perform inference processing through the loaded computing model. In this way, the decoupling of computing resources used for inference is realized. Inference processing through the computing model and data processing other than inference processing can be realized by different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem that the fixed CPU/GPU specifications of existing heterogeneous computing machines limit the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios The problem.

此外，对于用户来说，其在使用涉及推理的应用时，推理计算可以无缝转接到远程具有推理加速资源的目标计算设备上进行，且当前计算设备和目标计算设备之间的交互对于用户是无感知的，因此，可以保证涉及推理的应用的业务逻辑和用户进行推理业务的使用习惯不变，低成本地实现了推理且提升了用户使用体验。In addition, for users, when using applications involving inference, the inference calculation can be seamlessly transferred to the remote target computing device with inference acceleration resources, and the interaction between the current computing device and the target computing device is important to the user. It is imperceptible. Therefore, it can ensure that the business logic of the application involving reasoning and the user's use habits for reasoning business remain unchanged, and the reasoning is realized at low cost and the user experience is improved.

实施例四Example four

本实施例的推理方法仍从第一计算设备的角度，对本发明的推理方法进行说明。参照图5，示出了根据本发明实施例四的一种推理方法的流程图，该推理方法包括：The reasoning method of this embodiment still describes the reasoning method of the present invention from the perspective of the first computing device. Referring to Fig. 5, a flowchart of an inference method according to Embodiment 4 of the present invention is shown, and the inference method includes:

步骤S402：获取进行推理的计算模型的模型信息，并将所述模型信息发送至目标计算设备。Step S402: Obtain model information of the calculation model for inference, and send the model information to the target computing device.

在一种可行方式中，所述计算模型的模型信息为所述计算模型的标识信息或校验信息。In a feasible manner, the model information of the calculation model is identification information or verification information of the calculation model.

在具体实现时，第一计算设备可将所述标识信息或校验信息发送给目标计算设备，如前述实施例中的第二计算设备，由目标计算设备根据所述标识信息或校验信息判断本地是否具有对应的计算模型，并将判断结果反馈给第一计算设备，实现第一计算设备对目标计算设备中是否存在计算模型的判断。In a specific implementation, the first computing device may send the identification information or verification information to the target computing device, such as the second computing device in the foregoing embodiment, the target computing device determines according to the identification information or verification information Whether there is a corresponding calculation model locally, and the judgment result is fed back to the first computing device, so that the first computing device can determine whether there is a calculation model in the target computing device.

步骤S404：若根据所述模型信息确定所述目标计算设备中不存在所述计算模型，则将所述计算模型发送至所述目标计算设备，并指示所述目标计算设备使用所述目标计算设备中设置的推理加速资源载入所述计算模型。Step S404: If it is determined according to the model information that the calculation model does not exist in the target computing device, the calculation model is sent to the target computing device, and the target computing device is instructed to use the target computing device The reasoning acceleration resource set in is loaded into the calculation model.

当所述计算模型的模型信息为所述计算模型的标识信息或校验信息时，本步骤可以实现为：若通过所述标识信息或所述校验信息，确定所述目标计算设备中不存在所述计算模型，则将所述计算模型发送至所述目标计算设备。包括：将计算模型的结构及其数据均发送给目标计算设备。When the model information of the calculation model is the identification information or verification information of the calculation model, this step may be implemented as: if the identification information or the verification information is used, it is determined that the target computing device does not exist The calculation model sends the calculation model to the target computing device. Including: sending the structure of the calculation model and its data to the target computing device.

若目标计算设备中不存在所需的计算模型，则可以将计算模型发送给目标计算设备。目标计算设备获取并存储该计算模型，在后续使用中若再次使用到该计算模型，则目标计算设备可以直接从本地获取。由此，保证了不论目标计算设备中是否具有所需的计算模型，都可顺利实现推理处理。If the required calculation model does not exist in the target computing device, the calculation model can be sent to the target computing device. The target computing device obtains and stores the calculation model, and if the calculation model is used again in subsequent use, the target computing device can directly obtain it locally. As a result, it is ensured that the reasoning process can be smoothly realized regardless of whether the required calculation model is available in the target computing device.

步骤S406：获取待推理数据，并将所述待推理数据发送至所述目标计算设备，以指示所述目标计算设备使用推理加速资源调用载入的所述计算模型，通过所述计算模型对所述待推理数据进行推理处理。Step S406: Obtain the data to be inferred, and send the data to be inferred to the target computing device to instruct the target computing device to use the inference acceleration resource to call the loaded calculation model, and to use the calculation model to Describe the data to be inferred for inference processing.

在一种可行方式中，所述获取待推理数据，并将所述待推理数据发送至所述目标计算设备可以包括：获取请求所述计算模型对所述待推理数据进行推理处理的推理请求，并对所述推理请求进行语义分析；根据语义分析结果确定待调用的所述计算模型中的处理函数，将所述处理函数的信息和所述待推理数据发送给所述目标计算设备，以指示所述目标计算设备通过调用载入的所述计算模型中所述处理函数的信息指示的处理函数，对所述待推理数据进行推理处理。In a feasible manner, the obtaining the data to be inferred and sending the data to be inferred to the target computing device may include: obtaining a reasoning request requesting the calculation model to perform inference processing on the data to be inferred, And perform semantic analysis on the reasoning request; determine the processing function in the calculation model to be called according to the result of the semantic analysis, and send the processing function information and the data to be reasoned to the target computing device to indicate The target computing device performs inference processing on the data to be inferred by calling the processing function indicated by the information of the processing function in the loaded calculation model.

其中，所述处理函数的信息可选地可以为所述处理函数的API接口信息。Wherein, the information of the processing function may optionally be API interface information of the processing function.

步骤S408：接收所述目标计算设备反馈的所述推理处理的结果。Step S408: Receive the result of the inference processing fed back by the target computing device.

实施例五Example five

参照图6，示出了根据本发明实施例五的一种推理方法的流程图。本实施例的推理方法从第二计算设备的角度，对本发明的推理方法进行说明。Referring to Fig. 6, there is shown a flow chart of an inference method according to the fifth embodiment of the present invention. The reasoning method of this embodiment describes the reasoning method of the present invention from the perspective of the second computing device.

步骤S502：获取源计算设备发送的用于推理的计算模型的模型信息，通过推理加速资源载入所述模型信息指示的计算模型。Step S502: Obtain the model information of the calculation model used for inference sent by the source computing device, and load the calculation model indicated by the model information through the inference acceleration resource.

本实施例中，源计算设备可以为前述实施例中的第一计算设备，所述模型信息包括但不限于标识信息和/或校验信息。In this embodiment, the source computing device may be the first computing device in the foregoing embodiment, and the model information includes, but is not limited to, identification information and/or verification information.

步骤S504：获取所述源计算设备发送的待推理数据，使用推理加速资源调用载入的所述计算模型，通过所述计算模型对所述待推理数据进行推理处理。Step S504: Obtain the data to be inferred sent by the source computing device, use the inference acceleration resource to call the loaded calculation model, and perform inference processing on the data to be inferred through the calculation model.

通过推理加速资源载入计算模型后，当接收到从源计算设备发送来的待推理数据，即可使用推理加速资源载入的计算模型对其进行推理处理。After the inference acceleration resource is loaded into the calculation model, when the data to be inferred sent from the source computing device is received, the calculation model loaded by the inference acceleration resource can be used to perform inference processing on it.

步骤S506：向所述源计算设备反馈所述推理处理的结果。Step S506: Feed back the result of the inference processing to the source computing device.

在具体实现时，本实施例的推理方法可以由前述实施例中第二计算设备的推理服务端实现，上述过程的具体实现也可参照前述实施例中推理服务端的操作，在此不再赘述。In specific implementation, the inference method of this embodiment can be implemented by the inference server of the second computing device in the foregoing embodiment, and the specific implementation of the foregoing process can also refer to the operation of the inference server in the foregoing embodiment, which will not be repeated here.

通过本实施例，将推理处理部署在不同的计算设备中，其中，执行本实施例的推理方法的当前计算设备中设置有推理加速资源，可以通过计算模型进行主要的推理处理，而源计算设备则可以负责推理处理之前和之后的数据处理。在进行推理时，源计算设备可以先将计算模型的模型信息发送给当前计算设备，当前计算设备使用推理加速资源载入相应的计算模型；接着，源计算设备向当前计算设备发送待推理数据，当前计算设备在接收到待推理数据后，即可通过载入的计算模型进行推理处理。由此，实现了推理所使用的计算资源的解耦，通过计算模型进行的推理处理和推理处理之外的数据处理可以通过不同的计算设备实现，其中一台配置有推理加速资源如GPU即可，无需一台电子设备上同时具有CPU和GPU，有效解决了因现有异构计算机器的CPU/GPU的规格固定，而使涉及推理的应用的部署受限，导致无法满足广泛的推理场景需求的问题。Through this embodiment, inference processing is deployed in different computing devices, where the current computing device that executes the inference method of this embodiment is equipped with inference acceleration resources, and the main inference processing can be performed through the computing model, and the source computing device It can be responsible for the data processing before and after the inference processing. When performing inference, the source computing device can first send the model information of the computing model to the current computing device, and the current computing device uses the inference acceleration resource to load the corresponding computing model; then, the source computing device sends the data to be inferred to the current computing device. After the current computing device receives the data to be inferred, it can perform inference processing through the loaded computing model. In this way, the decoupling of computing resources used for inference is realized. Inference processing through the computing model and data processing other than inference processing can be realized by different computing devices, one of which is equipped with an inference acceleration resource such as a GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem that the fixed CPU/GPU specifications of existing heterogeneous computing machines limit the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios The problem.

此外，对于用户来说，其在使用涉及推理的应用时，推理计算可以无缝转接到远程具有推理加速资源的设备上进行，且源计算设备和当前计算设备之间的交互对于用户是无感知的，因此，可以保证涉及推理的应用的业务逻辑和用户进行推理业务的使用习惯不变，低成本地实现了推理且提升了用户使用体验。In addition, for users, when using applications involving inference, the inference calculation can be seamlessly transferred to a remote device with inference acceleration resources, and the interaction between the source computing device and the current computing device is irrelevant to the user. Perceptual, therefore, it can be ensured that the business logic of the application involving reasoning and the user's usage habits for reasoning business remain unchanged, and the reasoning is realized at low cost and the user experience is improved.

实施例六Example Six

本实施例的推理方法仍从第二计算设备的角度，对本发明的推理方法进行说明。参照图7，示出了根据本发明实施例六的一种推理方法的流程图，该推理方法包括：The reasoning method of this embodiment still describes the reasoning method of the present invention from the perspective of the second computing device. Referring to FIG. 7, there is shown a flowchart of an inference method according to Embodiment 6 of the present invention, and the inference method includes:

步骤S602：根据计算模型的模型信息，确定本地不存在所述计算模型，则向所述源计算设备请求所述计算模型，并在从所述源计算设备获取所述计算模型后，通过所述推理加速资源载入所述计算模型。Step S602: According to the model information of the calculation model, it is determined that the calculation model does not exist locally, then the source calculation device is requested for the calculation model, and after the calculation model is obtained from the source calculation device, pass the The inference acceleration resource is loaded into the calculation model.

本实施例中，计算模型的模型信息可以为所述计算模型的标识信息或校验信息；则，本步骤可以实现为：根据所述标识信息或所述校验信息，确定本地不存在所述计算模型，则向所述源计算设备请求所述计算模型，并在从所述源计算设备获取所述计算模型后，通过所述推理加速资源载入所述计算模型。所述源计算设备发送来的计算模型包括但不限于计算模型的结构及其对应数据。In this embodiment, the model information of the calculation model may be the identification information or verification information of the calculation model; then, this step may be implemented as: according to the identification information or the verification information, it is determined that the local Computing model, request the computing model from the source computing device, and after obtaining the computing model from the source computing device, load the computing model through the inference acceleration resource. The calculation model sent by the source computing device includes, but is not limited to, the structure of the calculation model and its corresponding data.

此外，在一种可选方式中，所述推理加速资源包括一种或多种类型；当所述推理加速资源包括多种类型时，不同类型的推理加速资源具有不同的使用优先级；则，所述通过推理加速资源载入所述模型信息指示的计算模型包括：根据预设的负载均衡规则和多种类型的所述推理加速资源的优先级，使用推理加速资源载入所述模型信息指示的计算模型。In addition, in an optional manner, the reasoning acceleration resource includes one or more types; when the reasoning acceleration resource includes multiple types, different types of reasoning acceleration resources have different usage priorities; then, The calculation model indicated by the inference acceleration resource loading the model information includes: according to a preset load balancing rule and the priority of multiple types of the inference acceleration resource, the inference acceleration resource is used to load the model information indication Calculation model.

其中，所述负载均衡规则和所述优先级均可由本领域技术人员根据实际需求适当设置。Wherein, the load balancing rule and the priority can be appropriately set by those skilled in the art according to actual needs.

步骤S604：获取所述源计算设备发送的待推理数据，使用推理加速资源调用载入的所述计算模型，通过所述计算模型对所述待推理数据进行推理处理。Step S604: Obtain the data to be inferred sent by the source computing device, use the inference acceleration resource to call the loaded calculation model, and perform inference processing on the data to be inferred through the calculation model.

在一种可行方式中，本步骤可以实现为：获取源计算设备发送的待推理数据和待调用的所述计算模型中的处理函数的信息，通过调用载入的所述计算模型中所述处理函数的信息指示的处理函数，对所述待推理数据进行推理处理。其中，所述处理函数的信息可以由源计算设备通过对推理请求进行解析获得。In a feasible manner, this step can be implemented as: acquiring the data to be inferred sent by the source computing device and the information of the processing function in the computing model to be called, and calling the processing in the loaded computing model The processing function indicated by the function information performs inference processing on the data to be inferred. Wherein, the information of the processing function may be obtained by the source computing device by parsing the reasoning request.

可选地，所述处理函数的信息为所述处理函数的API接口信息。Optionally, the information of the processing function is API interface information of the processing function.

步骤S606：向所述源计算设备反馈所述推理处理的结果。Step S606: Feed back the result of the inference processing to the source computing device.

通过本实施例，将推理处理部署在不同的计算设备中，其中，执行本实施例的推理方法的当前计算设备中设置有推理加速资源，可以通过计算模型进行主要的推理处理，而源计算设备则可以负责推理处理之前和之后的数据处理。在进行推理时，源计算设备可以先将计算模型的模型信息发送给当前计算设备，当前计算设备使用推理加速资源载入相应的计算模型；接着，源计算设备向当前计算设备发送待推理数据，当前计算设备在接收到待推理数据后，即可通过载入的计算模型进行推理处理。由此，实现了推理所使用的计算资源的解耦，通过计算模型进行的推理处理和推理处理之外的数据处理可以通过不同的计算设备实现，其中一台配置有推理加速资源如GPU即可，无需一台电子设备上同时具有CPU和GPU，有效解决了因现有异构计算机器的CPU/GPU的规格固定，而使涉及推理的应用的部署受限，导致无法满足广泛的推理场景需求的问题。Through this embodiment, inference processing is deployed in different computing devices. The current computing device that executes the inference method of this embodiment is equipped with inference acceleration resources, and the main inference processing can be performed through the computing model, and the source computing device It can be responsible for the data processing before and after the inference processing. When performing inference, the source computing device can first send the model information of the computing model to the current computing device, and the current computing device uses the inference acceleration resource to load the corresponding computing model; then, the source computing device sends the data to be inferred to the current computing device. After the current computing device receives the data to be inferred, it can perform inference processing through the loaded computing model. In this way, the decoupling of computing resources used for inference is realized. Inference processing through the computing model and data processing other than inference processing can be realized by different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem that the fixed CPU/GPU specifications of existing heterogeneous computing machines limit the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios The problem.

实施例七Example Seven

参照图8，示出了根据本发明实施例七的一种电子设备的结构示意图，本发明具体实施例并不对电子设备的具体实现做限定。Referring to FIG. 8, there is shown a schematic structural diagram of an electronic device according to the seventh embodiment of the present invention. The specific embodiment of the present invention does not limit the specific implementation of the electronic device.

如图8所示，该电子设备可以包括：处理器(processor)702、通信接口(Communications Interface)704、存储器(memory)706、以及通信总线708。As shown in FIG. 8, the electronic device may include a processor (processor) 702, a communication interface (Communications Interface) 704, a memory (memory) 706, and a communication bus 708.

其中：among them:

处理器702、通信接口704、以及存储器706通过通信总线708完成相互间的通信。The processor 702, the communication interface 704, and the memory 706 communicate with each other through the communication bus 708.

通信接口704，用于与其它电子设备或服务器进行通信。The communication interface 704 is used to communicate with other electronic devices or servers.

处理器702，用于执行程序710，具体可以执行上述实施例三或四中的推理方法实施例中的相关步骤。The processor 702 is configured to execute the program 710, and specifically can execute the relevant steps in the inference method embodiment in the third or fourth embodiment.

具体地，程序710可以包括程序代码，该程序代码包括计算机操作指令。Specifically, the program 710 may include program code, and the program code includes computer operation instructions.

处理器702可能是中央处理器CPU，或者是特定集成电路ASIC(Application Specific Integrated Circuit)，或者是被配置成实施本发明实施例的一个或多个集成电路。智能设备包括的一个或多个处理器，可以是同一类型的处理器，如一个或多个CPU；也可以是不同类型的处理器，如一个或多个CPU以及一个或多个ASIC。The processor 702 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention. The one or more processors included in the smart device may be the same type of processor, such as one or more CPUs, or different types of processors, such as one or more CPUs and one or more ASICs.

存储器706，用于存放程序710。存储器706可能包含高速RAM存储器，也可能还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。The memory 706 is used to store the program 710. The memory 706 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), for example, at least one disk memory.

程序710具体可以用于使得处理器702执行以下操作：获取进行推理的计算模型的模型信息，并将所述模型信息发送至目标计算设备，以指示所述目标计算设备使用所述目标计算设备中设置的推理加速资源载入所述模型信息指示的计算模型；获取待推理数据，并将所述待推理数据发送至所述目标计算设备，以指示所述目标计算设备使用推理加速资源调用载入的所述计算模型，通过所述计算模型对所述待推理数据进行推理处理；接收所述目标计算设备反馈的所述推理处理的结果。The program 710 may specifically be used to cause the processor 702 to perform the following operations: obtain model information of a calculation model for inference, and send the model information to a target computing device to instruct the target computing device to use the target computing device The set inference acceleration resources are loaded into the calculation model indicated by the model information; the data to be inferred is obtained, and the data to be inferred is sent to the target computing device to instruct the target computing device to use the inference acceleration resource to call and load The calculation model is used to perform inference processing on the data to be inferred through the calculation model; and the result of the inference processing feedback from the target computing device is received.

在一种可选的实施方式中，程序710还用于使得处理器702在若确定所述目标计算设备中不存在所述计算模型时，则将所述计算模型发送至所述目标计算设备。In an optional implementation manner, the program 710 is further configured to cause the processor 702 to send the calculation model to the target computing device if it is determined that the calculation model does not exist in the target computing device.

在一种可选的实施方式中，所述计算模型的模型信息为所述计算模型的标识信息或校验信息；程序710还用于使得处理器702在所述若确定所述目标计算设备中不存在所述计算模型时，则将所述计算模型发送至所述目标计算设备之前，通过所述标识信息或所述校验信息，确定所述目标计算设备中是否存在所述计算模型。In an optional implementation manner, the model information of the calculation model is identification information or verification information of the calculation model; the program 710 is also used to enable the processor 702 to determine whether the target computing device is in the When the calculation model does not exist, before sending the calculation model to the target computing device, determine whether the calculation model exists in the target computing device through the identification information or the verification information.

在一种可选的实施方式中，程序710还用于使得处理器702在获取待推理数据，并将所述待推理数据发送至所述目标计算设备时：获取请求所述计算模型对所述待推理数据进行推理处理的推理请求，并对所述推理请求进行语义分析；根据语义分析结果确定待调用的所述计算模型中的处理函数，将所述处理函数的信息和所述待推理数据发送给所述目标计算设备，以指示所述目标计算设备通过调用载入的所述计算模型中所述处理函数的信息指示的处理函数，对所述待推理数据进行推理处理。In an optional implementation manner, the program 710 is further configured to enable the processor 702 to obtain the data to be inferred and send the data to be inferred to the target computing device: Perform a reasoning request for inference processing on the data to be inferred, and perform semantic analysis on the reasoning request; determine the processing function in the calculation model to be called according to the result of the semantic analysis, and combine the information of the processing function and the data to be inferred Sent to the target computing device to instruct the target computing device to perform inference processing on the data to be inferred by calling the processing function indicated by the processing function information in the loaded calculation model.

在一种可选的实施方式中，所述处理函数的信息为所述处理函数的API接口信息。In an optional implementation manner, the information of the processing function is API interface information of the processing function.

程序710中各步骤的具体实现可以参见上述相应的推理方法实施例中的相应步骤和单元中对应的描述，在此不赘述。所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的设备和模块的具体工作过程，可以参考前述方法实施例中的对应过程描述，在此不再赘述。For the specific implementation of each step in the program 710, reference may be made to the corresponding description in the corresponding steps and units in the above-mentioned corresponding inference method embodiment, which will not be repeated here. Those skilled in the art can clearly understand that, for convenience and concise description, the specific working process of the devices and modules described above can be referred to the corresponding process description in the foregoing method embodiment, which will not be repeated here.

通过本实施例的电子设备，将推理处理部署在不同的计算设备中，其中，目标计算设备中设置有推理加速资源，可以通过计算模型进行主要的推理处理，而执行本实施例的推理方法的当前电子设备则可以负责推理处理之前和之后的数据处理。在进行推理时，当前电子设备可以先将计算模型的模型信息发送给目标计算设备，目标计算设备使用推理加速资源载入相应的计算模型；接着，当前电子设备向目标计算设备发送待推理数据，目标计算设备在接收到待推理数据后，即可通过载入的计算模型进行推理处理。由此，实现了推理所使用的计算资源的解耦，通过计算模型进行的推理处理和推理处理之外的数据处理可以通过不同的计算设备实现，其中一台配置有推理加速资源如GPU即可，无需一台电子设备上同时具有CPU和GPU，有效解决了因现有异构计算机器的CPU/GPU的规格固定，而使涉及推理的应用的部署受限，导致无法满足广泛的推理场景需求的问题。Through the electronic device of this embodiment, the inference processing is deployed in different computing devices, where the target computing device is provided with inference acceleration resources, and the main inference processing can be performed through the computing model, and the inference method of this embodiment is executed. Current electronic equipment can be responsible for data processing before and after inference processing. When performing inference, the current electronic device can first send the model information of the calculation model to the target computing device, and the target computing device uses the inference acceleration resource to load the corresponding calculation model; then, the current electronic device sends the data to be inferred to the target computing device, After the target computing device receives the data to be inferred, it can perform inference processing through the loaded computing model. In this way, the decoupling of computing resources used for inference is realized. Inference processing through the computing model and data processing other than inference processing can be realized by different computing devices, one of which is equipped with an inference acceleration resource such as a GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem that the fixed CPU/GPU specifications of existing heterogeneous computing machines limit the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios The problem.

此外，对于用户来说，其在使用涉及推理的应用时，推理计算可以无缝转接到远程具有推理加速资源的目标计算设备上进行，且当前电子设备和目标计算设备之间的交互对于用户是无感知的，因此，可以保证涉及推理的应用的业务逻辑和用户进行推理业务的使用习惯不变，低成本地实现了推理且提升了用户使用体验。In addition, for users, when using applications involving inference, the inference calculation can be seamlessly transferred to the remote target computing device with inference acceleration resources, and the current interaction between the electronic device and the target computing device is important to the user. It is imperceptible. Therefore, it can ensure that the business logic of the application involving reasoning and the user's use habits for reasoning business remain unchanged, and the reasoning is realized at low cost and the user experience is improved.

实施例八Example eight

参照图9，示出了根据本发明实施例八的一种电子设备的结构示意图，本发明具体实施例并不对电子设备的具体实现做限定。9, there is shown a schematic structural diagram of an electronic device according to the eighth embodiment of the present invention. The specific embodiment of the present invention does not limit the specific implementation of the electronic device.

如图9所示，该电子设备可以包括：处理器(processor)802、通信接口(Communications Interface)804、存储器(memory)806、以及通信总线808。As shown in FIG. 9, the electronic device may include: a processor (processor) 802, a communication interface (Communications Interface) 804, a memory (memory) 806, and a communication bus 808.

其中：among them:

处理器802、通信接口804、以及存储器806通过通信总线808完成相互间的通信。The processor 802, the communication interface 804, and the memory 806 communicate with each other through the communication bus 808.

通信接口804，用于与其它电子设备或服务器进行通信。The communication interface 804 is used to communicate with other electronic devices or servers.

处理器802，用于执行程序810，具体可以执行上述实施例五或六中的推理方法实施例中的相关步骤。The processor 802 is configured to execute the program 810, and specifically can execute the relevant steps in the inference method embodiment in the fifth or sixth embodiment.

具体地，程序810可以包括程序代码，该程序代码包括计算机操作指令。Specifically, the program 810 may include program code, and the program code includes computer operation instructions.

处理器802可能是中央处理器CPU，或者是特定集成电路ASIC(Application Specific Integrated Circuit)，或者是被配置成实施本发明实施例的一个或多个集成电路。智能设备包括的一个或多个处理器，可以是同一类型的处理器，如一个或多个CPU；也可以是不同类型的处理器，如一个或多个CPU以及一个或多个ASIC。The processor 802 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention. The one or more processors included in the smart device may be the same type of processor, such as one or more CPUs, or different types of processors, such as one or more CPUs and one or more ASICs.

存储器806，用于存放程序810。存储器806可能包含高速RAM存储器，也可能还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。The memory 806 is used to store the program 810. The memory 806 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), for example, at least one disk memory.

程序810具体可以用于使得处理器802执行以下操作：获取源计算设备发送的用于推理的计算模型的模型信息，通过推理加速资源载入所述模型信息指示的计算模型；获取所述源计算设备发送的待推理数据，使用推理加速资源调用载入的所述计算模型，通过所述计算模型对所述待推理数据进行推理处理；向所述源计算设备反馈所述推理处理的结果。The program 810 may specifically be used to cause the processor 802 to perform the following operations: obtain model information of a calculation model used for inference sent by a source computing device, load the calculation model indicated by the model information through the inference acceleration resource, and obtain the source calculation For the data to be inferred sent by the device, use the inference acceleration resource to call the loaded calculation model, and perform inference processing on the data to be inferred through the calculation model; and feed back the result of the inference processing to the source computing device.

在一种可选的实施方式中，所述计算模型的模型信息为所述计算模型的标识信息或校验信息；程序810还用于使得处理器802在获取源计算设备发送的用于推理的计算模型的模型信息，通过推理加速资源载入所述模型信息指示的计算模型时：根据所述标识信息或所述校验信息，确定本地不存在所述计算模型，则向所述源计算设备请求所述计算模型，并在从所述源计算设备获取所述计算模型后，通过所述推理加速资源载入所述计算模型。In an optional implementation manner, the model information of the calculation model is identification information or verification information of the calculation model; the program 810 is also used to enable the processor 802 to obtain the reasoning information sent by the source computing device. When the model information of the calculation model is loaded into the calculation model indicated by the model information through the inference acceleration resource: according to the identification information or the verification information, it is determined that the calculation model does not exist locally, and then the source calculation device Request the calculation model, and after obtaining the calculation model from the source computing device, load the calculation model through the inference acceleration resource.

在一种可选的实施方式中，程序810还用于使得处理器802在获取所述源计算设备发送的待推理数据，使用推理加速资源调用载入的所述计算模型，通过所述计算模型对所述待推理数据进行推理处理时：获取源计算设备发送的待推理数据和待调用的所述计算模型中的处理函数的信息，通过调用载入的所述计算模型中所述处理函数的信息指示的处理函数，对所述待推理数据进行推理处理。In an optional implementation manner, the program 810 is further configured to enable the processor 802 to obtain the data to be inferred sent by the source computing device, use the inference acceleration resource to call the loaded calculation model, and use the calculation model When performing inference processing on the data to be inferred: obtain the data to be inferred sent by the source computing device and the information of the processing function in the calculation model to be called, and call the processing function in the loaded calculation model. The processing function indicated by the information performs inference processing on the data to be inferred.

在一种可选的实施方式中，所述推理加速资源包括一种或多种类型；当所述推理加速资源包括多种类型时，不同类型的推理加速资源具有不同的使用优先级；程序810还用于使得处理器802在通过推理加速资源载入所述模型信息指示的计算模型时：根据预设的负载均衡规则和多种类型的所述推理加速资源的优先级，使用推理加速资源载入所述模型信息指示的计算模型。In an optional implementation manner, the reasoning acceleration resource includes one or more types; when the reasoning acceleration resource includes multiple types, different types of reasoning acceleration resources have different usage priorities; program 810 It is also used to enable the processor 802 to load the calculation model indicated by the model information through the inference acceleration resource: according to preset load balancing rules and the priority of multiple types of the inference acceleration resource, use the inference acceleration resource to load Enter the calculation model indicated by the model information.

程序810中各步骤的具体实现可以参见上述相应的推理方法实施例中的相应步骤和单元中对应的描述，在此不赘述。所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的设备和模块的具体工作过程，可以参考前述方法实施例中的对应过程描述，在此不再赘述。For the specific implementation of each step in the program 810, reference may be made to the corresponding description in the corresponding steps and units in the above-mentioned corresponding inference method embodiment, which will not be repeated here. Those skilled in the art can clearly understand that, for convenience and concise description, the specific working process of the devices and modules described above can be referred to the corresponding process description in the foregoing method embodiment, which will not be repeated here.

通过本实施例的电子设备，将推理处理部署在不同的计算设备中，其中，执行本实施例的推理方法的当前电子设备中设置有推理加速资源，可以通过计算模型进行主要的推理处理，而源计算设备则可以负责推理处理之前和之后的数据处理。在进行推理时，源计算设备可以先将计算模型的模型信息发送给当前电子设备，当前电子设备使用推理加速资源载入相应的计算模型；接着，源计算设备向当前电子设备发送待推理数据，当前电子设备在接收到待推理数据后，即可通过载入的计算模型进行推理处理。由此，实现了推理所使用的计算资源的解耦，通过计算模型进行的推理处理和推理处理之外的数据处理可以通过不同的计算设备实现，其中一台配置有推理加速资源如GPU即可，无需一台电子设备上同时具有CPU和GPU，有效解决了因现有异构计算机器的CPU/GPU的规格固定，而使涉及推理的应用的部署受限，导致无法满足广泛的推理场景需求的问题。Through the electronic device of this embodiment, inference processing is deployed in different computing devices. Among them, the current electronic device that executes the inference method of this embodiment is equipped with inference acceleration resources, and the main inference processing can be performed through the computing model. The source computing device can then be responsible for the data processing before and after the inference processing. When performing inference, the source computing device can first send the model information of the calculation model to the current electronic device, and the current electronic device uses the inference acceleration resource to load the corresponding calculation model; then, the source computing device sends the data to be inferred to the current electronic device. After the current electronic device receives the data to be inferred, it can perform inference processing through the loaded calculation model. In this way, the decoupling of computing resources used for inference is realized. Inference processing through the computing model and data processing other than inference processing can be realized by different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem that the fixed CPU/GPU specifications of existing heterogeneous computing machines limit the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios The problem.

此外，对于用户来说，其在使用涉及推理的应用时，推理计算可以无缝转接到远程具有推理加速资源的设备上进行，且源计算设备和当前电子设备之间的交互对于用户是无感知的，因此，可以保证涉及推理的应用的业务逻辑和用户进行推理业务的使用习惯不变，低成本地实现了推理且提升了用户使用体验。In addition, for users, when using applications involving inference, the inference calculation can be seamlessly transferred to a remote device with inference acceleration resources, and the interaction between the source computing device and the current electronic device is irrelevant to the user. Perceptual, therefore, it can be ensured that the business logic of the application involving reasoning and the user's usage habits for reasoning business remain unchanged, and the reasoning is realized at low cost and the user experience is improved.

需要指出，根据实施的需要，可将本发明实施例中描述的各个部件/步骤拆分为更多部件/步骤，也可将两个或多个部件/步骤或者部件/步骤的部分操作组合成新的部件/步骤，以实现本发明实施例的目的。It should be pointed out that according to the needs of implementation, each component/step described in the embodiment of the present invention can be split into more components/steps, or two or more components/steps or partial operations of components/steps can be combined into New components/steps to achieve the purpose of the embodiments of the present invention.

上述根据本发明实施例的方法可在硬件、固件中实现，或者被实现为可存储在记录介质(诸如CD ROM、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码，或者被实现通过网络下载的原始存储在远程记录介质或非暂时机器可读介质中并将被存储在本地记录介质中的计算机代码，从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处理。可以理解，计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如，RAM、ROM、闪存等)，当所述软件或计算机代码被计算机、处理器或硬件访问且执行时，实现在此描述的推理方法。此外，当通用计算机访问用于实现在此示出的推理方法的代码时，代码的执行将通用计算机转换为用于执行在此示出的推理方法的专用计算机。The above method according to the embodiments of the present invention can be implemented in hardware, firmware, or implemented as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or implemented by The computer code downloaded from the network is originally stored in a remote recording medium or a non-transitory machine-readable medium and will be stored in a local recording medium, so that the method described here can be stored in a general-purpose computer, a special-purpose processor, or a programmable Or such software processing on a recording medium of dedicated hardware (such as ASIC or FPGA). It can be understood that a computer, a processor, a microprocessor controller, or programmable hardware includes a storage component (for example, RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is used by the computer, When the processor or hardware accesses and executes, it implements the inference method described here. In addition, when a general-purpose computer accesses the code for implementing the inference method shown here, the execution of the code converts the general-purpose computer into a special-purpose computer for executing the inference method shown here.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及方法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明实施例的范围。A person of ordinary skill in the art may be aware that the units and method steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the embodiments of the present invention.

以上实施方式仅用于说明本发明实施例，而并非对本发明实施例的限制，有关技术领域的普通技术人员，在不脱离本发明实施例的精神和范围的情况下，还可以做出各种变化和变型，因此所有等同的技术方案也属于本发明实施例的范畴，本发明实施例的专利保护范围应由权利要求限定。The above implementations are only used to illustrate the embodiments of the present invention, and are not intended to limit the embodiments of the present invention. Those of ordinary skill in the relevant technical field can also make various modifications without departing from the spirit and scope of the embodiments of the present invention. Changes and modifications, therefore, all equivalent technical solutions also belong to the scope of the embodiments of the present invention, and the scope of patent protection of the embodiments of the present invention should be defined by the claims.

Claims

一种推理***，其特征在于，包括相互连接的第一计算设备和第二计算设备，其中，所述第一计算设备中设置有推理客户端，所述第二计算设备中设置有推理加速资源以及推理服务端；An inference system, characterized by comprising a first computing device and a second computing device connected to each other, wherein the first computing device is provided with an inference client, and the second computing device is provided with an inference acceleration resource And reasoning server;

其中：among them:

所述推理客户端用于获取进行推理的计算模型的模型信息和待推理数据，并分别将所述模型信息和所述待推理数据发送至所述第二计算设备中的推理服务端；The inference client is used to obtain model information and data to be inferred of a calculation model for inference, and respectively send the model information and the data to be inferred to the inference server in the second computing device;

所述推理服务端用于通过推理加速资源载入并调用所述模型信息指示的计算模型，通过所述计算模型对所述待推理数据进行推理处理并向所述推理客户端反馈所述推理处理的结果。The reasoning server is used to load and call the calculation model indicated by the model information through the reasoning acceleration resource, perform reasoning processing on the data to be reasoned through the calculation model, and feed back the reasoning processing to the reasoning client the result of.
根据权利要求1所述的推理***，其特征在于，所述推理客户端还用于在确定所述第二计算设备中不存在所述计算模型时，将所述计算模型发送至所述推理服务端。The reasoning system according to claim 1, wherein the reasoning client is further configured to send the calculation model to the reasoning service when it is determined that the calculation model does not exist in the second computing device end.
根据权利要求2所述的推理***，其特征在于，所述计算模型的模型信息为所述计算模型的标识信息或校验信息；The reasoning system according to claim 2, wherein the model information of the calculation model is identification information or verification information of the calculation model;

所述推理服务端还用于通过所述标识信息或所述校验信息，确定所述第二计算设备中是否存在所述计算模型，并将确定结果返回给所述推理客户端。The reasoning server is further configured to determine whether the calculation model exists in the second computing device through the identification information or the verification information, and return the determination result to the reasoning client.
根据权利要求1所述的推理***，其特征在于，The reasoning system according to claim 1, wherein:

所述推理客户端还用于获取请求所述计算模型对所述待推理数据进行推理处理的推理请求，并对所述推理请求进行语义分析，根据语义分析结果确定待调用的所述计算模型中的处理函数，将所述处理函数的信息发送给所述推理服务端；The reasoning client is also used to obtain a reasoning request requesting the calculation model to perform reasoning processing on the data to be reasoned, perform semantic analysis on the reasoning request, and determine the calculation model to be called according to the semantic analysis result The processing function of, sending the information of the processing function to the inference server;

所述推理服务端在所述通过所述计算模型对所述待推理数据进行推理处理时，通过调用载入的所述计算模型中所述处理函数的信息指示的处理函数，对所述待推理数据进行推理处理。When the inference server performs inference processing on the data to be inferred through the calculation model, it calls the processing function indicated by the information of the processing function in the loaded calculation model to perform the inference process on the data to be inferred. Data is processed by reasoning.
根据权利要求4所述的推理***，其特征在于，所述处理函数的信息为所述处理函数的API接口信息。The reasoning system according to claim 4, wherein the information of the processing function is API interface information of the processing function.
根据权利要求1所述的推理***，其特征在于，所述第二计算设备中设置有一种或多种类型的推理加速资源；The reasoning system according to claim 1, wherein one or more types of reasoning acceleration resources are provided in the second computing device;

当所述推理加速资源包括多种类型时，不同类型的推理加速资源具有不同的使用优先级；When the reasoning acceleration resource includes multiple types, different types of reasoning acceleration resources have different usage priorities;

所述推理服务端根据预设的负载均衡规则和多种类型的所述推理加速资源的优先级，使用推理加速资源。The reasoning server uses the reasoning acceleration resource according to the preset load balancing rules and the priority of the multiple types of the reasoning acceleration resource.
根据权利要求1-6任一项所述的推理***，其特征在于，所述第一计算设备和所述第二计算设备通过弹性网络相互连接。The reasoning system according to any one of claims 1 to 6, wherein the first computing device and the second computing device are connected to each other through an elastic network.
根据权利要求1-6任一项所述的推理***，其特征在于，所述推理客户端为嵌入所述第一计算设备中的深度学习框架内部的组件，或者，所述推理客户端为可被所述深度学习框架调用的可调用文件。The reasoning system according to any one of claims 1-6, wherein the reasoning client is a component inside a deep learning framework embedded in the first computing device, or the reasoning client is a A callable file called by the deep learning framework.
一种推理方法，其特征在于，所述方法包括：A reasoning method, characterized in that the method includes:

获取进行推理的计算模型的模型信息，并将所述模型信息发送至目标计算设备，以指示所述目标计算设备使用所述目标计算设备中设置的推理加速资源载入所述模型信息指示的计算模型；Acquire model information of a calculation model for inference, and send the model information to a target computing device to instruct the target computing device to use the inference acceleration resource set in the target computing device to load the calculation indicated by the model information model;

获取待推理数据，并将所述待推理数据发送至所述目标计算设备，以指示所述目标计算设备使用推理加速资源调用载入的所述计算模型，通过所述计算模型对所述待推理数据进行推理处理；Obtain the data to be inferred, and send the data to be inferred to the target computing device to instruct the target computing device to use the inference acceleration resource to call the loaded calculation model, and use the calculation model to analyze the to-be inferred Data is processed by reasoning;

接收所述目标计算设备反馈的所述推理处理的结果。Receiving the result of the inference processing fed back by the target computing device.
根据权利要求9所述的方法，其特征在于，所述方法还包括：The method according to claim 9, wherein the method further comprises:

若确定所述目标计算设备中不存在所述计算模型时，则将所述计算模型发送至所述目标计算设备。If it is determined that the calculation model does not exist in the target computing device, the calculation model is sent to the target computing device.
根据权利要求10所述的方法，其特征在于，所述计算模型的模型信息为所述计算模型的标识信息或校验信息；The method according to claim 10, wherein the model information of the calculation model is identification information or verification information of the calculation model;

在所述若确定所述目标计算设备中不存在所述计算模型时，则将所述计算模型发送至所述目标计算设备之前，所述方法还包括：通过所述标识信息或所述校验信息，确定所述目标计算设备中是否存在所述计算模型。When it is determined that the calculation model does not exist in the target computing device, before sending the calculation model to the target computing device, the method further includes: passing the identification information or the verification Information to determine whether the calculation model exists in the target computing device.
根据权利要求9所述的方法，其特征在于，所述获取待推理数据，并将所述待推理数据发送至所述目标计算设备，包括：The method according to claim 9, wherein the acquiring the data to be inferred and sending the data to be inferred to the target computing device comprises:

获取请求所述计算模型对所述待推理数据进行推理处理的推理请求，并对所述推理请求进行语义分析；Acquiring a reasoning request requesting the calculation model to perform reasoning processing on the data to be reasoned, and performing semantic analysis on the reasoning request;

根据语义分析结果确定待调用的所述计算模型中的处理函数，将所述处理函数的信息和所述待推理数据发送给所述目标计算设备，以指示所述目标计算设备通过调用载入的所述计算模型中所述处理函数的信息指示的处理函数，对所述待推理数据进行推理处理。Determine the processing function in the calculation model to be called according to the semantic analysis result, and send the information of the processing function and the data to be inferred to the target computing device to instruct the target computing device to load by calling The processing function indicated by the information of the processing function in the calculation model performs inference processing on the data to be inferred.
根据权利要求12所述的方法，其特征在于，所述处理函数的信息为所述处理函数的API接口信息。The method according to claim 12, wherein the information of the processing function is API interface information of the processing function.
一种推理方法，其特征在于，所述方法包括：A reasoning method, characterized in that the method includes:

获取源计算设备发送的用于推理的计算模型的模型信息，通过推理加速资源载入所述模型信息指示的计算模型；Acquire model information of a calculation model used for inference sent by a source computing device, and load the calculation model indicated by the model information through the inference acceleration resource;

获取所述源计算设备发送的待推理数据，使用推理加速资源调用载入的所述计算模型，通过所述计算模型对所述待推理数据进行推理处理；Obtain the data to be inferred sent by the source computing device, use the inference acceleration resource to call the loaded calculation model, and perform inference processing on the data to be inferred through the calculation model;

向所述源计算设备反馈所述推理处理的结果。The result of the inference processing is fed back to the source computing device.
根据权利要求14所述的方法，其特征在于，所述计算模型的模型信息为所述计算模型的标识信息或校验信息；The method according to claim 14, wherein the model information of the calculation model is identification information or verification information of the calculation model;

所述获取源计算设备发送的用于推理的计算模型的模型信息，通过推理加速资源载入所述模型信息指示的计算模型，包括：The acquiring the model information of the calculation model used for inference sent by the source computing device, and loading the calculation model indicated by the model information through the inference acceleration resource includes:

根据所述标识信息或所述校验信息，确定本地不存在所述计算模型，则向所述源计算设备请求所述计算模型，并在从所述源计算设备获取所述计算模型后，通过所述推理加速资源载入所述计算模型。According to the identification information or the verification information, it is determined that the calculation model does not exist locally, the calculation model is requested from the source computing device, and after the calculation model is obtained from the source computing device, pass The reasoning acceleration resource is loaded into the calculation model.
根据权利要求14所述的方法，其特征在于，所述获取所述源计算设备发送的待推理数据，使用推理加速资源调用载入的所述计算模型，通过所述计算模型对所述待推理数据进行推理处理，包括：The method according to claim 14, wherein the acquiring the data to be inferred sent by the source computing device, using the inference acceleration resource to call the loaded calculation model, and using the calculation model to analyze the data to be inferred Data is processed by reasoning, including:

获取源计算设备发送的待推理数据和待调用的所述计算模型中的处理函数的信息，通过调用载入的所述计算模型中所述处理函数的信息指示的处理函数，对所述待推理数据进行推理处理。Obtain the data to be inferred sent by the source computing device and the information of the processing function in the calculation model to be invoked, and by calling the processing function indicated by the information of the processing function in the loaded calculation model, the data to be inferred Data is processed by reasoning.
根据权利要求16所述的方法，其特征在于，所述处理函数的信息为所述处理函数的API接口信息。The method according to claim 16, wherein the information of the processing function is API interface information of the processing function.
根据权利要求14所述的方法，其特征在于，所述推理加速资源包括一种或多种类型；The method according to claim 14, wherein the reasoning acceleration resource includes one or more types;

当所述推理加速资源包括多种类型时，不同类型的推理加速资源具有不同的使用优先级；When the reasoning acceleration resource includes multiple types, different types of reasoning acceleration resources have different usage priorities;

所述通过推理加速资源载入所述模型信息指示的计算模型包括：根据预设的负载均衡规则和多种类型的所述推理加速资源的优先级，使用推理加速资源载入所述模型信息指示的计算模型。The calculation model indicated by the inference acceleration resource loading the model information includes: according to a preset load balancing rule and the priority of multiple types of the inference acceleration resource, the inference acceleration resource is used to load the model information indication Calculation model.
一种电子设备，包括：处理器、存储器、通信接口和通信总线，所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信；An electronic device, comprising: a processor, a memory, a communication interface, and a communication bus. The processor, the memory, and the communication interface communicate with each other through the communication bus;

所述存储器用于存放至少一可执行指示，所述可执行指示使所述处理器执行如权利要求9-13中任一项所述的推理方法对应的操作，或者，所述可执行指示使所述处理器执行如权利要求14-18中任一项所述的推理方法对应的操作。The memory is used to store at least one executable instruction, the executable instruction causes the processor to perform an operation corresponding to the inference method according to any one of claims 9-13, or the executable instruction causes The processor executes operations corresponding to the inference method according to any one of claims 14-18.
一种计算机存储介质，其上存储有计算机程序，该程序被处理器执行时实现如权利要求9-13中任一项所述的推理方法；或者，实现如权利要求14-18中任一项所述的推理方法。A computer storage medium with a computer program stored thereon, and when the program is executed by a processor, the inference method according to any one of claims 9-13 is realized; or, any one of claims 14-18 is realized The reasoning method described.