CN112633502B

CN112633502B - Cross-platform execution method and device of deep learning model and electronic equipment

Info

Publication number: CN112633502B
Application number: CN202011598792.9A
Authority: CN
Inventors: 严春伟; 石晓伟; 胡志强
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2022-03-22
Anticipated expiration: 2040-12-29
Also published as: CN112633502A

Abstract

The invention discloses a cross-platform execution method and device of a deep learning model and electronic equipment, and relates to the field of artificial intelligence, in particular to the technical field of deep learning and the Internet of things. The specific implementation device is as follows: acquiring a deep learning model to be executed; generating an assigned value graph according to the deep learning model to be executed; acquiring a hardware operator corresponding to each node, and optimizing the assigned value graph according to the hardware operators corresponding to the nodes; and executing a hardware operator corresponding to each node according to the assigned value graph after optimization. According to the method and the device, the assigned value graph is optimized, the effective processing of the state transition phenomenon is realized, the cross-platform execution efficient scheduling of the deep learning model can be realized, the method and the device are effectively suitable for various different platforms, and the efficiency and the reliability of the cross-platform execution process of the deep learning model are improved.

Description

Cross-platform execution method and device of deep learning model and electronic equipment

Technical Field

The present disclosure relates to the field of computer technology, and more particularly, to the field of deep learning and internet of things technology.

Background

In recent years, with the rise of Deep Learning (DL) technology, the application scenarios thereof are gradually widened, and the technology is widely applied to various hardware platforms, such as smart phones, smart sound, watches, servers, and the like. As such, deep learning inference devices face more demanding challenges.

However, in the related art, the cross-platform execution method of the deep learning model has the technical problems that the deployment capability is really lightened only aiming at partial platforms such as servers and the like, and the like. Therefore, how to realize efficient scheduling of cross-platform execution of the deep learning model and effectively adapt to various different platforms becomes one of important research directions.

Disclosure of Invention

The disclosure provides a cross-platform execution method and device of a deep learning model and electronic equipment.

According to an aspect of the present disclosure, a cross-platform execution method of a deep learning model is provided, including:

obtaining a deep learning model to be executed, wherein the deep learning model comprises a plurality of code components;

generating an assigned value graph according to the deep learning model to be executed, wherein the assigned value graph comprises a plurality of nodes, and each node corresponds to one code component;

acquiring a hardware operator corresponding to each node, and optimizing the assigned value graph according to the hardware operators corresponding to the nodes;

and executing a hardware operator corresponding to each node according to the assigned value graph after optimization.

According to another aspect of the present disclosure, there is provided a cross-platform execution device of a deep learning model, including:

the deep learning module is used for acquiring a deep learning model to be executed, wherein the deep learning model comprises a plurality of code components;

the generating module is used for generating an assigned value graph according to the deep learning model to be executed, wherein the assigned value graph comprises a plurality of nodes, and each node corresponds to one code component;

the optimization module is used for acquiring a hardware operator corresponding to each node and optimizing the assigned value graph according to the hardware operators corresponding to the nodes;

and the execution module is used for executing the hardware operator corresponding to each node according to the optimized assigned value graph.

According to another aspect of the present disclosure, there is provided a deep learning inference engine comprising: the cross-platform execution device of the deep learning model according to the second aspect of the disclosure.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of cross-platform execution of a deep learning model according to the first aspect of the disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the cross-platform execution method of the deep learning model of the first aspect of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the cross-platform execution method of the deep learning model according to the first aspect of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a process for optimizing an assignment map;

FIG. 5 is a schematic diagram of another process for optimizing an assignment map;

FIG. 6 is a schematic diagram of another process for optimizing an assignment map;

FIG. 7 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 9 is a schematic diagram of an overall process of cross-platform execution of a deep learning model;

FIG. 10 is a block diagram of a cross-platform execution device of a deep learning model used to implement a cross-platform execution method of a deep learning model of an embodiment of the present disclosure;

fig. 11 is a block diagram of an electronic device used to implement a cross-platform execution method of a deep learning model or a cross-platform execution apparatus of a deep learning model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The following briefly describes the technical field to which the disclosed solution relates:

computer Technology can be broadly divided into several areas, Computer system Technology, Computer component Technology, and Computer assembly Technology. The computer technology comprises the following steps: the basic principle of the operation method, the design of an arithmetic unit, an instruction system, the design of a Central Processing Unit (CPU), the pipeline principle, the application of the basic principle in the CPU design, a storage system, a bus and input and output.

AI (Artificial Intelligence) is a subject for studying a computer to simulate some thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) of a human being, and has a technology at a hardware level and a technology at a software level. Artificial intelligence hardware techniques generally include computer vision techniques, speech recognition techniques, natural language processing techniques, and learning/deep learning thereof, big data processing techniques, knowledge-graph techniques, and the like.

DL (Deep Learning), which is an intrinsic rule and a representation hierarchy of sample data, is learned, and information obtained in these Learning processes greatly helps interpretation of data such as text, image, and sound. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art.

The Internet of Things (IOT) is to collect any object or process needing monitoring, connection and interaction in real time and collect various information needed by sound, light, heat, electricity, mechanics, chemistry, biology, location and The like through various devices and technologies such as various information sensors, radio frequency identification technology, global positioning system, infrared sensor, laser scanner and The like, and to realize ubiquitous connection of objects and objects, and objects and people through various possible network accesses, so as to realize intelligent sensing, identification and management of objects and processes. The internet of things is an information bearer based on the internet, a traditional telecommunication network and the like, and all common physical objects which can be independently addressed form an interconnected network.

The following describes a cross-platform execution method and apparatus of a deep learning model according to an embodiment of the present disclosure, and an electronic device with reference to the drawings.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. It should be noted that the execution subject of the cross-platform execution method of the deep learning model in the embodiment of the present disclosure is a cross-platform execution device of the deep learning model, and the cross-platform execution device of the deep learning model may specifically be a hardware device, or software in the hardware device, or the like. The hardware devices are, for example, terminal devices, servers, and the like. As shown in fig. 1, the cross-platform execution method of the deep learning model provided in this embodiment includes the following steps:

s101, obtaining a deep learning model to be executed, wherein the deep learning model comprises a plurality of code components.

The deep learning model to be executed can be any deep learning model which tries to execute across platforms.

S102, generating an assigned value graph according to a deep learning model to be executed, wherein the assigned value graph comprises a plurality of nodes, and each node corresponds to one code component.

The Assignment Graph may be an original Single Assignment Graph (SSA Graph).

S103, acquiring a hardware operator corresponding to each node, and optimizing the assigned value graph according to the hardware operators corresponding to the nodes.

It should be noted that, in the prior art, deep learning is gradually applied to various hardware platforms. However, hardware on different platforms may be different, such as ARM CPU (Advanced RISC machinery central Processing Unit), X86 CPU, NV GPU (Graphics Processing Unit), etc., and there are cases where different hardware is executed in a mixed manner during one model execution, such as mixed execution of X86 CPU and NV GPU, etc.

In the related art, the platform such as a server is usually only targeted, and the capability of light-weight deployment is lacked. In a specific framework design, the analysis and execution coupling of a plurality of existing platforms are based on Op (operator) which is independent of hardware. Meanwhile, in a specific hardware execution strategy, there is a case where only one hardware execution is supported and a simple hardware mixed execution strategy needs to be executed through a fixed artificial logic.

Therefore, the existing deep learning inference device cannot effectively adapt to various different platforms, and particularly cannot meet the deployment requirement of a mobile terminal or the internet of things.

Therefore, the cross-platform execution method of the deep learning model provided by the disclosure can optimize the assigned value graph through the hardware operator corresponding to the multiple nodes, so that the multiple hardware can be universally driven to be mixed and executed, and the method is suitable for integration of different hardware.

And S104, executing a hardware operator corresponding to each node according to the optimized assigned value graph.

In the embodiment of the present disclosure, after the assignment map is optimized, the hardware operator corresponding to each node may be executed according to the assignment map after optimization.

According to the cross-platform execution method of the deep learning model, the deep learning model to be executed can be obtained, the assigned value graph is generated according to the deep learning model to be executed, the hardware operator corresponding to each node is obtained, the assigned value graph is optimized according to the hardware operators corresponding to the nodes, the hardware operator corresponding to each node is executed according to the optimized assigned value graph, the assigned value graph is optimized, effective processing of a state transfer phenomenon is achieved, cross-platform execution efficient scheduling of the deep learning model can be achieved, the method is effectively suitable for various platforms, and efficiency and reliability of the cross-platform execution process of the deep learning model are improved.

It should be noted that, because there are many different corresponding situations between a node and a hardware operator, for example, two adjacent nodes respectively correspond to different hardware operators, there are redundant hardware operators in the hardware operators corresponding to the multiple nodes, and there are fusible hardware operators in the hardware operators corresponding to the multiple nodes, and the like. Therefore, in the present disclosure, the assignment graph can be optimized based on different correspondence between the nodes and the hardware operators.

Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure. As a possible implementation manner, as shown in fig. 2, on the basis of the foregoing embodiment, the method specifically includes the following steps:

s201, obtaining a deep learning model to be executed, wherein the deep learning model comprises a plurality of code components.

S202, generating an assigned value graph according to a deep learning model to be executed, wherein the assigned value graph comprises a plurality of nodes, and each node corresponds to one code component.

Steps S201 to S202 are the same as steps S101 to S102, and are not described again here.

S203, acquiring a hardware operator corresponding to each node, and optimizing the assigned value graph according to the hardware operators corresponding to the nodes.

The following explains a specific process of optimizing the assignment diagram according to the hardware operators corresponding to the plurality of nodes, respectively, for the case that two adjacent nodes respectively correspond to different hardware operators, a redundant hardware operator exists in the hardware operators corresponding to the plurality of nodes, and a fusible hardware operator exists in the hardware operators corresponding to the plurality of nodes.

As a possible implementation manner, as shown in fig. 3, on the basis of the foregoing embodiment, a specific process of optimizing the assignment graph according to the hardware operators corresponding to the multiple nodes in the foregoing step includes the following steps:

s301, if two adjacent nodes respectively correspond to different hardware operators, generating a conversion transfer component according to the hardware operators corresponding to the two adjacent nodes.

S302, adding a conversion transfer component between two adjacent nodes.

As a possible implementation manner, if two adjacent nodes respectively correspond to different hardware operators, type derivation and type conversion are performed according to the hardware operators corresponding to the two adjacent nodes to generate a conversion transfer component. Further, a transition passing component may be added between two adjacent nodes.

For example, as shown in fig. 4, for the operators kernel0 and kernel1, different a and b are used as the input and output of the operators, and are respectively different types; var0 represents a variable of the unset type. In this case, alternatively, as shown in fig. 5, a type derivation may be performed, in which the b output of kernel0 is derived to be the type of var0 as a, and var0 attempts to pass to a of kernel1, and since the two types are different, a collision may be identified.

Further, as shown in fig. 6, Type conversion may be performed by Type-casting Pass (Type conversion removal) or the like. Alternatively, after discovering the type conflict of var0, a conversion transfer component is generated as typecast operator, and by inserting typecast operator, a that can be passed to kernel1 as input by converting type a of var0 into type b of var 1. Thus, the front-back type matching of the hardware operator can be ensured.

As a possible implementation manner, if a redundant hardware operator exists in the hardware operators corresponding to the plurality of nodes, the nodes corresponding to the redundant hardware operator in the plurality of nodes are removed.

It should be noted that, in the present disclosure, a specific manner for removing a node corresponding to a redundant hardware operator from a plurality of nodes is not limited, and may be selected according to an actual situation. Optionally, redundant hardware operators generated in quantization training can be removed by a Quant-dequantization Pass (quantization-inverse quantization removal) manner.

As a possible implementation manner, if a fusible hardware operator exists in the hardware operators corresponding to the nodes, nodes corresponding to the fusible hardware operator in the nodes are fused.

It should be noted that, in the present disclosure, a specific manner for removing a node corresponding to a redundant hardware operator from a plurality of nodes is not limited, and may be selected according to an actual situation. Optionally, the redundant hardware operator generated in the quantization training may be removed by an Op fuse Pass (operator fusion removal) manner.

It should be noted that, in the present disclosure, when attempting to optimize the assigned value graph according to the hardware operators corresponding to the plurality of nodes, the node to be extracted in the plurality of nodes may be removed.

As a possible implementation manner, as shown in fig. 7, on the basis of the foregoing embodiment, a specific process of optimizing the assigned value map according to the hardware operator corresponding to the plurality of nodes in the foregoing step includes the following steps:

s701, obtaining a code component to be extracted to a third-party execution engine.

S702, determining a corresponding node to be extracted according to the code component to be extracted.

And S703, removing the nodes to be extracted from the plurality of nodes.

It should be noted that, in the present disclosure, a specific manner for removing a node to be extracted from a plurality of nodes is not limited, and may be selected according to an actual situation. Alternatively, the node to be extracted in the plurality of nodes may be removed in a Subgraph detection Pass (Subgraph detection removal) manner.

In the present disclosure, when attempting to optimize the assigned value graph according to the hardware operators corresponding to the plurality of nodes, inconsistent nodes in the plurality of nodes may be removed.

As a possible implementation manner, as shown in fig. 8, on the basis of the foregoing embodiment, a specific process of optimizing the assigned value map according to the hardware operator corresponding to the plurality of nodes in the foregoing step includes the following steps:

s801, obtaining a target deep learning model, wherein the target deep learning model comprises a plurality of target code components.

S802, determining code components which are inconsistent with the target code components in the code components.

And S803, determining the corresponding inconsistent nodes according to the inconsistent code components.

And S804, removing inconsistent nodes in the plurality of nodes.

It should be noted that, in the present disclosure, a specific manner for removing inconsistent nodes from a plurality of nodes is not limited, and may be selected according to actual situations. Alternatively, a specific hardware operator may be selected by a Kernel packaging Pass (feed removal) manner to remove inconsistent nodes from the plurality of nodes.

Further, the translation delivery component may be multiplexed after the assignment graph is optimized according to the hardware operators corresponding to the plurality of nodes.

It should be noted that, in the present disclosure, a specific manner for multiplexing the translation and transfer component is not limited, and may be selected according to an actual situation. Optionally, the translation and transfer component may be reused in a Memory optimization Pass (Memory optimization and elimination) manner, so as to reduce the consumption of the whole Memory.

And S204, executing a hardware operator corresponding to each node according to the optimized assigned value graph.

Step S204 is the same as step S104, and is not described herein again.

According to the deep learning inference device disclosed by the embodiment of the disclosure, the assigned value graph can be optimized based on different corresponding conditions between the nodes and the hardware operators, so that the multiple hardware hybrid execution can be universally driven, the deep learning inference device is suitable for integration of different hardware, efficient and reliable scheduling of the deep learning inference device is realized, and the efficiency and reliability of a deep learning model in a cross-platform execution process are further improved.

It should be noted that, in the present disclosure, the overall process of the deep learning model, which is executed across platforms, may be divided into 5 parts as shown in fig. 9. The first to fourth parts are analysis stages, and the fifth part is an execution stage.

The first to third parts may be performed off-line, in which case only part of the fifth part is finally needed. Therefore, the cross-platform execution method of the deep learning model provided by the disclosure can strictly control the complexity of the relevant modules in the execution period, so that the deployment volume can be compressed, and the performance consumption irrelevant to the framework is avoided.

And the third part can correspondingly analyze and optimize the relevant characteristics of the model and hardware by enriching the capability of the analysis model.

The fifth part can drive the situation of multi-hardware mixed execution through an abstract execution model. Therefore, the cross-platform execution method of the deep learning model provided by the disclosure can be universally suitable for integration of different hardware. Meanwhile, for the fifth part, corresponding pluggable support can be provided on a compiling system by splitting code directories of different hardware so as to avoid unnecessary interference between development and deployment of different hardware.

Therefore, the cross-platform execution method of the deep learning model provided by the disclosure has the advantages of multi-hardware scheduling, light deployment, high hardware integration capability and the like. The foregoing outstanding advantages of the cross-platform execution method of the deep learning model are explained below.

For multi-hardware scheduling, it should be noted that, in a deep learning inference application scenario, mixed execution of multiple kinds of hardware is a normal state, for example, at a server end, a CPU is mixed with a GPU, or at a mobile phone end, cross execution of an ARM CPU and a mobile GPU, and the like. For the fifth part of fig. 9, the execution may be by any hardware executing together. The multiple types of hardware are mixed and executed, the execution stage of the inference engine is that operators on different hardware are called in a cross mode, complex state transition of data positions is brought, for example, data on a CPU need to be transmitted to a GPU, and corresponding GPU calculation can be called.

In this case, similar to the change of the Data position, there are similar phenomena of state transition in other aspects, such as the quantization Precision of calculation (Precision), Data Layout (Data Layout), and the like, for example, such analysis as that contained in the fourth section in fig. 9.

In the embodiment of the present disclosure, in order to process these state transitions, a concept related to Type System is introduced from the compiler, that is, the third part in fig. 9 abstracts similar different hardware positions, quantization precision, and data arrangement into types, and the state transitions are represented as Type transitions, and all representations and analysis optimizations are built in a computation graph corresponding to strong Type representations, and a corresponding infrastructure for analysis optimization is provided.

Therefore, the cross-platform execution method of the deep learning model provided by the disclosure can generally schedule any number of hardware executions, and simultaneously can integrally support the mixed execution of quantization precision, data arrangement and the like.

Aiming at light deployment, equipment in a mobile terminal and an internet of things is often weaker than equipment on a server in the aspects of memory, disks, calculation power and the like, and a related reasoning engine is required to control corresponding resource occupation.

Therefore, the cross-platform execution method of the deep learning model provided by the disclosure can independently deploy the corresponding modules of the execution phase by strictly splitting the analysis phase and the execution phase, thereby compressing the occupation of irrelevant resources of the final deployment phase. Further, in addition to the lightweight of the inference library itself, the present disclosure supports low loss compression of the model, such as the accuracy of Int16, directly removing half the volume of the model, greatly reducing disk and memory consumption.

Aiming at high hardware integration capability, it is noted that in recent years, a hardware platform is developed vigorously, the types of hardware special for inference are more and more, and different hardware calling behaviors are different, which also brings great challenges to an inference engine software layer. In order to support more hardware back ends in a universal manner, the cross-platform execution method of the deep learning model provided by the disclosure does not specially require that hardware have the same calling Application Programming Interface (API), but opens a few necessary interfaces for the hardware back end, so that the hardware back end can customize the behavior of the framework on both the upper layer and the bottom layer, and fully exerts the capability of the hardware.

These interfaces focus on macroscopic graph analysis, such as operator fusion, and interfaces for transmitting operator information, and any hardware can obtain information executed by the corresponding hardware through graph analysis.

In summary, the cross-platform execution method of the deep learning model provided by the disclosure can be applied to various platforms, can be executed under various hardware, and can be efficiently executed by fully utilizing the characteristics of the hardware platform; the hybrid execution of various hardware on one platform can be supported; the running resources are small enough, and the execution on terminals such as mobile phones can be supported; the deployed library and the deployed model are small in size and can be executed in scenes such as the Internet of things.

Corresponding to the cross-platform execution methods of the deep learning model provided in the foregoing embodiments, an embodiment of the present disclosure further provides a cross-platform execution device of the deep learning model, and since the cross-platform execution device of the deep learning model provided in the embodiment of the present disclosure corresponds to the cross-platform execution methods of the deep learning model provided in the foregoing embodiments, the implementation manner of the cross-platform execution method of the deep learning model is also applicable to the cross-platform execution device of the deep learning model provided in the embodiment, and is not described in detail in the embodiment.

FIG. 10 is a schematic structural diagram of a cross-platform execution device of a deep learning model according to one embodiment of the present disclosure.

As shown in fig. 10, the apparatus 1000 for executing a deep learning model across platforms includes: an acquisition module 1010, a generation module 1020, an optimization module 1030, and an execution module 1040. Wherein the content of the first and second substances,

an obtaining module 1010, configured to obtain a deep learning model to be executed, where the deep learning model includes a plurality of code components;

a generating module 1020, configured to generate an assigned value graph according to the deep learning model to be executed, where the assigned value graph includes a plurality of nodes, and each node corresponds to one code component;

an optimizing module 1030, configured to obtain a hardware operator corresponding to each node, and optimize the assigned value graph according to the hardware operators corresponding to the nodes;

an executing module 1040, configured to execute, according to the optimized assigned value graph, a hardware operator corresponding to each node.

Wherein, the optimizing module 1030 is further configured to: and if the fusible hardware operator exists in the hardware operators corresponding to the nodes, fusing the nodes corresponding to the fusible hardware operator in the nodes.

Wherein, the optimizing module 1030 is further configured to: acquiring the code component to be extracted to a third-party execution engine; determining the corresponding node to be extracted according to the code component to be extracted; removing the node to be extracted from the plurality of nodes.

Wherein, the optimizing module 1030 is further configured to: obtaining a target deep learning model, wherein the target deep learning model comprises a plurality of target code components; determining the code component of the plurality of code components that is inconsistent with the plurality of target code components; determining the corresponding inconsistent nodes according to the inconsistent code components; removing the inconsistent nodes from the plurality of nodes.

Wherein, the optimizing module 1030 is further configured to: multiplexing the translation passing components.

According to the deep learning inference device disclosed by the embodiment of the disclosure, a deep learning model to be executed can be obtained, an assigned value graph is generated according to the deep learning model to be executed, then a hardware operator corresponding to each node is obtained, the assigned value graph is optimized according to the hardware operators corresponding to the nodes, and then the hardware operator corresponding to each node is executed according to the optimized assigned value graph, so that the assigned value graph is optimized, the effective processing of a state transfer phenomenon is realized, the cross-platform execution efficient scheduling of the deep learning model can be realized, the cross-platform execution efficient scheduling can be effectively adapted to various different platforms, and the efficiency and the reliability of the cross-platform execution process of the deep learning model are improved.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 11 shows a schematic block diagram of an example electronic device 1100 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the device 1100 comprises a computing unit 1101, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM1103, various programs and data necessary for the operation of the device 1100 may also be stored. The calculation unit 1101, the ROM 1102, and the RAM1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

A number of components in device 1100 connect to I/O interface 1105, including: an input unit 1106 such as a keyboard, a mouse, and the like; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108 such as a magnetic disk, optical disk, or the like; and a communication unit 1109 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 1101 can be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The computing unit 1101 performs the various methods and processes described above, such as a cross-platform execution method of a deep learning model. For example, in some embodiments, the cross-platform execution method of the deep learning model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109. When the computer program is loaded into RAM1103 and executed by computing unit 1101, one or more steps of the cross-platform execution method of the deep learning model described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured by any other suitable means (e.g., by means of firmware) to execute a cross-platform execution method of the deep learning model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a cross-platform execution apparatus of a general purpose computer, special purpose computer, or other programmable deep learning model, such that the program codes, when executed by the processor or controller, cause the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The service end can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service (Virtual Private Server, or VPS for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A cross-platform execution method of a deep learning model comprises the following steps:

executing a hardware operator corresponding to each node according to the optimized assigned value graph;

wherein, the optimizing the assigned value graph according to the hardware operator corresponding to the plurality of nodes includes:

if two adjacent nodes respectively correspond to different hardware operators, generating a conversion transfer component according to the hardware operators corresponding to the two adjacent nodes;

adding the transition passing component between the two adjacent nodes;

the generating of the conversion transfer component according to the hardware operators corresponding to the two adjacent nodes comprises:

and performing type derivation and type conversion according to the hardware operators corresponding to the two adjacent nodes to generate the conversion transmission component.

2. The method of claim 1, wherein the optimizing the assigned graph according to the hardware operators corresponding to the plurality of nodes comprises:

and if redundant hardware operators exist in the hardware operators corresponding to the nodes, removing the nodes corresponding to the redundant hardware operators in the nodes.

3. The method of claim 1, wherein the optimizing the assigned graph according to the hardware operators corresponding to the plurality of nodes comprises:

and if the fusible hardware operator exists in the hardware operators corresponding to the nodes, fusing the nodes corresponding to the fusible hardware operator in the nodes.

4. The method of claim 1, wherein the optimizing the assigned graph according to the hardware operators corresponding to the plurality of nodes comprises:

acquiring the code component to be extracted to a third-party execution engine;

determining the corresponding node to be extracted according to the code component to be extracted;

removing the node to be extracted from the plurality of nodes.

5. The method of claim 1, wherein the optimizing the assigned graph according to the hardware operators corresponding to the plurality of nodes comprises:

obtaining a target deep learning model, wherein the target deep learning model comprises a plurality of target code components;

determining the code component of the plurality of code components that is inconsistent with the plurality of target code components;

determining the corresponding inconsistent nodes according to the inconsistent code components;

removing the inconsistent nodes from the plurality of nodes.

6. The method of claim 1, wherein the optimizing the assigned graph according to the hardware operators corresponding to the plurality of nodes further comprises:

multiplexing the translation passing components.

7. A cross-platform execution device of a deep learning model, comprising:

the execution module is used for executing the hardware operator corresponding to each node according to the optimized assigned value graph;

the optimization module is further configured to:

adding the transition passing component between the two adjacent nodes;

8. The apparatus of claim 7, wherein the optimization module is further configured to:

9. The apparatus of claim 7, wherein the optimization module is further configured to:

10. The apparatus of claim 7, wherein the optimization module is further configured to:

acquiring the code component to be extracted to a third-party execution engine;

removing the node to be extracted from the plurality of nodes.

11. The apparatus of claim 7, wherein the optimization module is further configured to:

removing the inconsistent nodes from the plurality of nodes.

12. The apparatus of claim 7, wherein the optimization module is further configured to:

multiplexing the translation passing components.

13. A deep learning inference engine comprising: the cross-platform execution device of the deep learning model of any one of claims 7-12.

14. An electronic device comprising a processor and a memory;

wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the cross-platform execution method of the deep learning model according to any one of claims 1 to 6.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a cross-platform execution method of a deep learning model according to any one of claims 1 to 6.