CN117762588A

CN117762588A - Method for scheduling Kernel in neural network and related products

Info

Publication number: CN117762588A
Application number: CN202311769986.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-03-26

Abstract

The present disclosure provides a method and related products for scheduling Kernel in a neural network, wherein the method may be included in a combined processing device, which may also include a universal interconnect interface and other processing devices. The computing device interacts with other processing devices to jointly complete the computing operation designated by the user. The combined processing means may further comprise storage means connected to the device and the other processing means, respectively, for storing data of the device and the other processing means.

Description

Method for scheduling Kernel in neural network and related products

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to the selection and scheduling of Kernel in neural networks.

Background

Deep learning frameworks typically need to support a variety of different hardware platforms, meaning that the same deep learning algorithm may need to be implemented on different hardware platforms. On the same hardware platform, to achieve the desired performance, an algorithm often employs a number of different implementations (in this application, implementations are referred to as Kernel), and different Kernel are typically designed to cope with different scales. However, even for the same scale, the performance of different Kernel may also show a large difference, and especially in the case where the scales covered by different Kernel overlap, if the Kernel with the best performance cannot be selected, the performance of hardware cannot be fully exerted, and the performance of algorithm is poor, which further affects the performance of the whole network.

Therefore, how to select a scale with better performance is a problem to be solved.

Disclosure of Invention

One object of the present disclosure is to select Kernel with better performance for the scale of neural networks to fully exploit the performance of the hardware.

According to a first aspect of the present disclosure, there is provided a method of scheduling Kernel in a neural network, comprising: determining a current scale of the neural network; selecting a preferred Kernel from a plurality of Kernels, wherein the preferred Kernel has matching performance on the current scale; and running the preferred Kernel through a computing interface, thereby realizing the scheduling of Kernel.

According to a second aspect of the present disclosure, there is provided a system for scheduling Kernel in a neural network, comprising: a controller configured to determine a current scale of the neural network; a query interface configured to select a preferred Kernel having matching performance for the current scale from a plurality of kernels; and a computing interface configured to run the preferred Kernel through the computing interface, thereby enabling scheduling of Kernel.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having stored therein computer executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.

According to a fourth aspect of the present disclosure there is provided a computer readable storage medium comprising computer executable instructions which, when executed by one or more processors, perform a method as described above.

By the method provided by the disclosure, more suitable Kernel can be selected for the neural networks with different scales, so that the system operation efficiency is remarkably improved.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 shows a schematic structural view of a board of an embodiment of the present disclosure;

FIG. 2 is a schematic diagram showing a combination processing apparatus;

FIG. 3 illustrates an internal structural schematic of a computing device;

FIG. 4 illustrates the internal architecture of a processor;

FIG. 5 illustrates a flow chart of a method of scheduling Kernel in a neural network, according to one embodiment of the present disclosure;

FIG. 6 illustrates a situation where existing scales overlap according to one embodiment of the present disclosure;

FIG. 7 shows a schematic diagram of a system for dynamically scheduling Kernel in a neural network, according to one embodiment of the present disclosure;

FIG. 8 shows a flow diagram of selecting a preferred Kernel from a plurality of Kernel with matching performance for the current scale in a heuristic search mode; and

fig. 9 shows a schematic flow chart of selecting a preferred Kernel with matching performance for the current scale from a plurality of Kernel in a walk-through search mode.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, specification, and drawings of this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. The terms "first," "second," "third," and "fourth," etc. also do not denote a single term, but may denote a plurality of terms. The terms "comprises" and "comprising" when used in the specification and claims of the first present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Today's semiconductor fabrication process begins with a complete wafer (wafer) consisting of a circular slice of pure silicon, typically divided into 6 inch, 8 inch, 12 inch, etc. specifications, the wafer is cut into individual dice, which are called dice (die). Each die has a chip attached to it and wiring is arranged to perform a specific electrical function. Then the chip is packaged into a particle by taking the crystal grain as a unit, the packaging aims at placing, fixing, sealing, protecting the chip and enhancing the electrothermal performance, and simultaneously, the contact of the chip is connected to the pin of the packaging shell by a wire, so that the chip packaging structure is completed.

The memory is used for temporarily storing operation data required by the system on chip and data exchanged with the external memory. In this embodiment, the memory may be a high-bandwidth memory (high bandwidth memory, HBM), which is a high-performance DRAM fabricated based on a 3D stack process, and is suitable for applications requiring high memory bandwidth, such as graphics processors, network switching and forwarding devices (e.g., routers, switches), and the like.

A system on chip (SoC) refers to a technology of integrating a complete system on a single chip and grouping all or part of necessary electronic circuits into packets. In this embodiment, the system-on-chip is mounted on a board. Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board 10 includes a combination processing device 101, which is an artificial intelligent computing unit for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in complex scenarios in the fields of computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is largely applied to the cloud intelligent field, and one remarkable characteristic of the cloud intelligent application is that the input data volume is large, and the high requirements on the storage capacity and the computing capacity of the platform are provided.

The combination processing apparatus 101 is connected to an external device 103 via an external interface apparatus 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the combination processing apparatus 101 through the external interface apparatus 102. The calculation result of the combination processing means 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes an external memory 104 for storing data, which includes one or more memory units 105. The external memory 104 is connected to the control device 106 and the combination processing apparatus 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the combination processing apparatus 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a schematic diagram showing the combination processing apparatus 101 of this embodiment. As shown in fig. 2, the combination processing device 101 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204. In one application scenario, the computing device 201, the interface device 202, and the processing device 203 are integrated into the aforementioned system-on-chip. In another application scenario, the computing device 201 itself is the aforementioned system-on-chip.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit, graphics processor, or other general and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

The DRAM 204 is the aforementioned high-bandwidth memory for storing data to be processed, typically 16G or more, and is used for storing data of the computing device 201 and/or the processing device 203.

Fig. 3 shows a schematic diagram of the internal structure of the computing device 201. The computing device 201 is configured to process input data such as computer vision, voice, natural language, and data mining, and the computing device 201 in the figure adopts a multi-core hierarchical structure design, which includes an external memory controller 301, a peripheral communication module 302, an on-chip interconnection module 303, a synchronization module 304, and a plurality of clusters 305.

There may be a plurality of external memory controllers 301, 2 being shown by way of example, for accessing external memory devices, such as DRAM 204 in FIG. 2, to read data from or write data to the off-chip in response to an access request issued by the processor core. The peripheral communication module 302 is configured to receive a control signal from the processing device 203 through the interface device 202, and activate the computing device 201 to perform a task. The on-chip interconnect module 303 connects the external memory controller 301, the peripheral communication module 302, and the plurality of clusters 305 for transferring data and control signals between the respective modules. The synchronization module 304 is a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each cluster to ensure synchronization of information. The plurality of clusters 305 are the computing cores of the computing device 201, 4 being illustratively shown, and as hardware progresses, the computing device 201 of the present disclosure may also include 8, 16, 64, or even more clusters 305. The cluster 305 is used to efficiently execute the deep learning algorithm.

Each cluster 305 includes a plurality of processor cores (IPU cores) 306 and a memory core (MEM core) 307.

The processor cores 306 are illustratively shown as 4 in the figures, and the present disclosure does not limit the number of processor cores 306. The internal architecture is shown in fig. 4. Each processor core 306 includes three major modules: a control module 41, an operation module 42 and a storage module 43.

The control module 41 is used for coordinating and controlling the operation of the operation module 42 and the storage module 43 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 411 and an instruction decode unit (instruction decode unit, IDU) 412. The instruction fetching unit 411 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 412 decodes the fetched instruction and sends the decoded result to the operation module 42 and the storage module 43 as control information.

The operation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation, etc.; the matrix operation unit 422 is responsible for the core computation of the deep learning algorithm, i.e. matrix multiplication and convolution.

The storage module 43 is used for storing or handling related data, including a neuron storage unit (NRAM) 431, a weight storage unit (WRAM) 432, an input/output direct memory access module (input/output direct memory access, IODMA) 433, and a handling direct memory access module (move direct memory access, MVDMA) 434.NRAM 431 is used to store input, output data and intermediate results for computation by processor core 306; WRAM 432 is configured to store weights for the deep learning network; the IODMA 433 controls access to the NRAM 431/WRAM 432 and the DRAM 204 via the broadcast bus 309; MVDMA 434 is used to control access to NRAM 431/WRAM 432 and SRAM 308.

Returning to FIG. 3, the storage cores 307 are primarily used to store and communicate, i.e., to store shared data or intermediate results between the processor cores 306, as well as to perform communications between the clusters 305 and the DRAM 204, between the clusters 305, between the processor cores 306, etc. In other embodiments, the memory core 307 has scalar operation capabilities to perform scalar operations.

The memory core 307 includes a shared memory unit (SRAM) 308, a broadcast bus 309, a clustered direct memory access module (cluster direct memory access, CDMA) 310, and a global direct memory access module (global direct memory access, GDMA) 311. The SRAM 308 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 306 in the same cluster 305 is not required to be obtained from the processor cores 306 to the DRAM 204 respectively, but is transferred between the processor cores 306 through the SRAM 308, and the memory core 307 only needs to rapidly distribute the multiplexed data from the SRAM 308 to a plurality of processor cores 306, so that the inter-core communication efficiency is improved, and the on-chip off-chip input/output access is greatly reduced.

Broadcast bus 309, CDMA 310, and GDMA 311 are used to perform inter-processor core 306 communications, inter-cluster 305 communications, and cluster 305 and DRAM 204 data transfers, respectively. As will be described below, respectively.

The broadcast bus 309 is used to perform high-speed communication between the processor cores 306 in the cluster 305. The broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to the transmission of data from point to point (i.e., single processor core to single processor core), multicast is a communication scheme that transfers a piece of data from SRAM 308 to a specific number of processor cores 306, and broadcast is a communication scheme that transfers a piece of data from SRAM 308 to all processor cores 306, a special case of multicast.

CDMA 310 is used to control access to SRAM 308 between different clusters 305 within the same computing device 201. The GDMA 311 cooperates with the external memory controller 301 to control access of the SRAM 308 of the cluster 305 to the DRAM 204 or to read data from the DRAM 204 into the SRAM 308.

Specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 5 shows a flow chart of a method of scheduling Kernel in a neural network according to one embodiment of the present disclosure. As shown in fig. 5, the method includes: in operation S510, determining a current scale of the neural network; selecting a preferred Kernel from a plurality of Kernels having matching performance for the current scale in operation S520; and running the preferred Kernel through a computing interface to implement Kernel scheduling in operation S530.

The size of the neural network is determined by various factors, such as the number of layers of the neural network, the connection manner (sparse connection or dense connection) between neurons, and the like. In order to improve the performance of the neural network with different scales, different Kernel can be designed for different scales so as to fully exert the performance of hardware.

Thus, when the system determines the size of a neural network, a Kernel that is most adapted to the current size may be selected from a plurality of Kernel to enable the adapted Kernel to perform better.

The appropriate Kernel can be selected in a static manner and in a dynamic manner. The static mode will be described first.

According to one embodiment of the present disclosure, the method of the present disclosure further comprises: a look-up table is established to facilitate selection of a preferred Kernel from a plurality of kerels that has matching performance for the current scale, wherein the look-up table includes an existing scale identifier for identifying an existing scale of the neural network and a Kernel identifier for identifying a Kernel for the existing scale.

First, the terms "existing scale" and "current scale" in the above are explained. "existing scale" refers to an existing scale that can be used to build a lookup table or file; and "current scale" refers to the new scale currently discovered. In this disclosure, a judgment of the newly discovered "current scale" is required to select the appropriate Kernel.

Specifically, first, evaluation indexes of Kernel performance, such as run time, parallelism, additional memory requirements, etc., may be formulated.

It should be understood that the Kernel performance evaluation criteria is not one-dimensional or unitary, but may exist as a plurality of evaluation indicators. For example, some users may be more concerned with the speed of Kernel operation, in which case the time of operation becomes the primary evaluation index. In another case, for example, in the case of a low external hardware configuration (e.g., a small memory), the running speed may not be the most important, and in the case of a limited hardware configuration, it is the most important to implement the stable running of Kernel, so in this case, the additional memory requirement may become the primary evaluation index.

It should be understood that the above list only some examples of performance indicators, and not exhaustive of performance indicators. Those skilled in the art can set different performance indexes according to their own needs.

Next, all Kernel implementations were traversed for the existing scale, and the evaluation index for each Kernel for the different scale was recorded. Then, according to one embodiment of the present disclosure, the Kernel identifiers in the lookup table may be ranked according to Kernel performance, that is, the Kernel is ranked according to the required core evaluation index, a preferred Kernel is selected, and the ranking result is stored in a specific lookup table or file. Kernel in a lookup table or file may be organized in a dictionary, e.g., one { key, value } key value pair for each size. key is the scale identifier and value corresponds to the identifier of the preferred Kernel.

According to one embodiment of the present disclosure, selecting a preferred Kernel from a plurality of Kernel having matching performance for the current scale comprises: searching an optimal Kernel corresponding to the current scale in the lookup table; and if the corresponding optimal Kernel is found, selecting the found optimal Kernel as the optimal Kernel with matching performance for the current scale.

Specifically, the user may call the operator interface to obtain the scale information of the neural network from the host (host) end, and then first find out whether the corresponding scale can be hit or not, i.e. whether the neural network of the corresponding scale can be queried, in the designated lookup table or file. If hit is possible, the corresponding kernel may be directly called according to the { key, value } key value pair described above. And if not hit, then

The default rules described herein may be defined according to the needs of the user. For example, if the corresponding scale is missed, kernel may be found in the look-up table that corresponds to the scale closest to the desired scale. Alternatively, kernel most likely to match the current scale may be evaluated based on statistics. Default rules do not necessarily select the best Kernel, but have a high probability of selecting a Kernel that meets the basic requirements.

According to one embodiment of the present disclosure, the existing scale includes a single existing scale, a plurality of discrete existing scales, or a continuous existing scale range.

It is to be understood that the existing scales described above may be of a single numerical scale or may be of multiple discrete and discontinuous scales. For example, in practical testing, a certain Kernel may exhibit better performance for multiple discrete scales, so that the Kernel can be matched to these discrete scales.

It is also to be understood that the existing scale described above may also be a continuous range. For example, in practical tests, a certain Kernel may have good performance for scales within a certain continuous range, so that the Kernel can be matched to those continuous scales.

According to one embodiment of the present disclosure, overlapping may occur for existing scales of different Kernel.

Fig. 6 illustrates a situation where existing scales overlap according to one embodiment of the present disclosure. As shown in fig. 6, for the first Kernel K1, the range that performs well includes the range R1 shown in fig. 6, and for the second Kernel K2, the range that performs well includes the range R2 shown in fig. 6, with the interval Rc of cross overlap existing between the range R1 and the range R2. In this cross overlap region Rc, both Kernel K1 and K2 perform well, in which case either K1 or K2 can be arbitrarily chosen as the target Kernel.

The above describes the case of selecting the appropriate kernal in a static manner, and the following describes the case of selecting the appropriate kernal in a dynamic manner.

Fig. 7 shows a schematic diagram of a system 700 for dynamically scheduling Kernel in a neural network, according to one embodiment of the present disclosure.

As shown in fig. 7, the system 700 includes a controller 710 configured to determine a current size of a neural network; a query interface 720 configured to select a preferred Kernel from a plurality of kernels that has matching capabilities for the current scale; and a computing interface 730 configured to run the preferred Kernel through the computing interface to implement Kernel scheduling.

In this embodiment, a specialized query interface 720 is added to specifically select the Kernel that best matches the current scale and send the selected preferred Kernel to the computing interface for operation. This way of selecting the preferred Kernel requires adding a corresponding interface for the user to invoke, according to the above-described embodiments of the present disclosure.

The ways to dynamically search and select the preferred Kernel can also be divided into two ways, one being heuristic and one being traversal.

Fig. 8 shows a schematic flow chart of selecting a preferred Kernel from a plurality of Kernel with matching performance for the current scale in a heuristic search mode.

As shown in fig. 8, selecting a preferred Kernel from a plurality of Kernel having matching performance for the current scale may include: in operation S810, receiving estimated performance for each Kernel of the plurality of kernels for the existing scale; in operation S820, the estimated performance of the plurality of Kernel is ranked so as to select a preferred Kernel having matching performance for the current scale.

Under the heuristic search mode, all Kernel functions do not need to be actually operated, but the Kernel can return the estimated performance of Kernel for different scales according to the known parameter information. These estimated performances are then ranked to determine the performance possible for different Kernel at different scales, thereby facilitating subsequent selection of Kernel with the best performance. In the heuristic search mode, since Kernel is not required to be actually run at the device side, the heuristic search mode is faster. While the final choice of preferred Kernel may not be optimal, there is a high probability that it is acceptable and meets the basic requirements.

In another way, fig. 9 shows a schematic flow chart of selecting a preferred Kernel with matching performance for the current scale from a plurality of kernels in the traversal search mode.

As shown in fig. 9, selecting a preferred Kernel from a plurality of Kernel having matching performance for the current scale includes: transmitting a plurality of Kernel to the device side to run at the device side in operation S910; in operation S920, receiving performance of Kernel returned from the device side; and ordering the performance of the plurality of Kernel in operation S930 to facilitate selection of a preferred Kernel having matching performance for the current scale.

As described above, in a traversal search mode, multiple or all Kernel's are sent to the device (device) end to actually run through, which exactly gets the real performance of each Kernel for different scales when actually executed by the device end.

These real properties are then ranked to determine possible real performance for different Kernel at different scales, thereby facilitating subsequent selection of Kernel with the best performance. In a traversal search mode, the selected Kernel must be optimal. However, since all Kernel is required to be run on the device side, it takes a long time, and the storage requirement in this way is also high.

According to aspects of the present disclosure, multiple interfaces (not just one query interface) may be provided for user invocation, multiple candidate Kernel may be run in an interface for a given scale, and the Kernel's hardware runtime may be actually tested. Finally, kernel with best measured performance and corresponding time can be returned. The user can record the information of the Kernel, and when the neural network with the same scale operates, the Kernel is directly called, so that the performance can be optimized. Such a scheme is applicable to inference scenarios where the scale of the model is relatively fixed.

Likewise, for dynamic search modes, the existing scale may also include a single existing scale, a plurality of discrete existing scales, or a continuous existing scale range. And the existing scales of the different kernels overlap.

It is to be understood that the above static and dynamic schemes do not necessarily exist alone, but may be used in combination. For example, for a currently existing set of Kernel, the Kernel may be statically associated with different scales, and when a new Kernel is added, the performance of the newly added Kernel may be evaluated in the manner shown in fig. 8, without actually running the Kernel, thereby reducing the time consumption for obtaining Kernel performance; or a new Kernel may be sent to the device side for actual operation to accurately determine at which scale or scales the newly added Kernel performs best.

The technical scheme aims to solve the problem that Kernel selection is difficult to select Kernel when the number of Kernel is large, so that Kernel with better performance cannot be hit in a plurality of scales. In addition, the technical scheme of the disclosure can also solve the problem of partial scale performance degradation caused by the difficulty in calling the globally optimal Kernel due to Kernel selection when the Kernel is added.

Specific operational procedures according to the present disclosure are described in more detail below in conjunction with exemplary codes.

1. First, the maximum workspace required for searching is determined through cnnlgetconnection workspace (), which is used to indicate the maximum storage space required for searching, so as to facilitate the normal operation of searching.

2. Next, through the cnnlconflux forward algorithm () interface, a list of all available kernels at a specific scale is collected first, and then the time of each Kernel in the kernels list is obtained. Thereafter, an array is returned, and each element in the array is a structure cnnlConvolitionFwdalgo Perf_t, including information such as the running return state of kernel, time, and workspace_size used. All elements in the array may be arranged in a run-time ascending order. According to embodiments of the present disclosure, the search mode of the two interfaces may be different, e.g., findAlgo is a traversal search to ensure that the Kernel is globally optimal, but this mode is longer to use; getAlgo is a heuristic search that guarantees that Kernel is relatively good with a high probability, but that this approach takes relatively short time.

3. Finally, a computing interface cnnnlconforwardd (), cnnnlquantizeConvolitionforward (), cnnnlconforringInformance () or cnnnlFusedOpExecute (), etc. can be called, and the user can transfer the required algo (algorithm) into the corresponding computing interface, and then the computing interface enters the designated algorithm to perform the operation.

It is to be understood that the above interfaces, functions, etc. are merely examples, and that different forms of instructions may be employed for different programs, and are not limited to the examples given above.

The present disclosure also provides an electronic device, including: one or more processors; and a memory having stored therein computer executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.

The present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.

According to one embodiment of the present disclosure, the static approach does not require user adaptation, and the user is unaware, facilitating improved customer experience.

According to another embodiment of the present disclosure, the dynamic approach requires a new interface and user display adaptation. The dynamic scheme is suitable for all scales and has wide supporting range. The new added individual search interface is not put into the computing interface, but the search interface can be called to select the optimal kernel because the information of the scale of the network is generally obtained in advance in the scene of deep learning reasoning and training. Therefore, frequent call of the search interface in the network training and actual deployment stage can be avoided, and the performance of the operation stage is improved.

Further, in the actual scene, the user can select the Kernel implementation needed by the user according to the index of the user's most attention, and the Kernel implementation is not limited to a single index. For example, if the user is more concerned with the need for additional memory, kernel with the smallest storage requirement may be selected based on the value of the index.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

The foregoing has outlined rather closely the embodiments of the present disclosure, and detailed description of the principles and embodiments of the present disclosure have been presented herein with the application of specific examples, the description of the examples above being merely intended to facilitate an understanding of the method of the present disclosure and its core ideas. Also, those skilled in the art, based on the teachings of the present disclosure, may make modifications or variations in the specific embodiments and application scope of the present disclosure, all falling within the scope of the protection of the present disclosure. In view of the foregoing, this description should not be construed as limiting the disclosure.

Claims

1. A method of scheduling Kernel in a neural network, comprising:

determining a current scale of the neural network;

selecting a preferred Kernel from a plurality of Kernels, wherein the preferred Kernel has matching performance on the current scale; and

the preferred Kernel is run through a computing interface to implement the scheduling of Kernel.

2. The method of claim 1, further comprising:

a look-up table is established to facilitate selection of a preferred Kernel from a plurality of kerels that has matching performance for the current scale, wherein the look-up table includes an existing scale identifier for identifying an existing scale of the neural network and a Kernel identifier for identifying a Kernel for the existing scale.

3. The method of claim 2, further comprising:

the Kernel identifiers in the lookup table are ordered according to Kernel's performance.

4. A method according to claim 2 or 3, wherein the existing scale comprises a single existing scale, a plurality of discrete existing scales or a continuous existing scale range.

5. The method of claim 4, wherein existing scales of different Kernel overlap.

6. The method of any of claims 2-5, wherein selecting a preferred Kernel from a plurality of kernels that has matching performance for the current scale comprises:

searching an optimal Kernel corresponding to the current scale in the lookup table;

and if the corresponding optimal Kernel is found, selecting the found optimal Kernel as the optimal Kernel with matching performance for the current scale.

7. The method of claim 6, further comprising:

if the corresponding optimal Kernel is not found, the preferred Kernel is selected according to a default rule.

8. The method of claim 1, wherein a preferred Kernel having matching performance for the current scale is selected from a plurality of Kernel through a different query interface than the computing interface.

9. The method of claim 8, wherein selecting a preferred Kernel from a plurality of Kernel having matching performance for the current scale comprises:

receiving estimated performance of each Kernel of the plurality of Kernel aiming at the existing scale;

the estimated performance of the plurality of Kernel is ranked so as to select a preferred Kernel having matching performance for the current scale.

10. The method of claim 9, wherein the existing scale comprises a single existing scale, a plurality of discrete existing scales, or a continuous existing scale range.

11. The method of claim 10, wherein existing scales of different Kernel overlap.

12. The method of claim 8, wherein selecting a preferred Kernel from a plurality of Kernel having matching performance for the current scale comprises:

transmitting a plurality of Kernel to the equipment end to run at the equipment end;

receiving the performance of Kernel returned from the equipment end; and

the performance of the plurality of Kernel is ranked so as to select a preferred Kernel having matching performance for the current scale.

13. The method of any of claims 1-12, wherein performance comprises at least one of:

run time of Kernel;

parallelism of Kernel;

kernel memory requirements.

14. A system for scheduling Kernel in a neural network, comprising:

a controller configured to determine a current scale of the neural network;

a query interface configured to select a preferred Kernel having matching performance for the current scale from a plurality of kernels; and

and the computing interface is configured to run the preferred Kernel through the computing interface so as to realize the scheduling of the Kernel.

15. The system of claim 14, wherein the controller is further configured to:

16. The system of claim 15, the controller further configured to:

17. The system of claim 15 or 16, wherein the existing scale comprises a single existing scale, a plurality of discrete existing scales, or a continuous existing scale range.

18. The system of claim 17, wherein existing scales of different kernels overlap.

19. The system of any of claims 15 to 18, wherein the query interface is configured to:

20. The system of claim 19, wherein the query interface is configured to be further configured to:

21. The system of claim 14, wherein the query interface is configured to:

22. The system of claim 21, wherein the existing scale comprises a single existing scale, a plurality of discrete existing scales, or a continuous existing scale range.

23. The system of claim 22, wherein existing scales of different Kernel overlap.

24. The system of claim 14, wherein the query interface is configured to:

receiving the performance of Kernel returned from the equipment end; and

25. The system of any of claims 14-24, wherein performance comprises at least one of:

run time of Kernel;

parallelism of Kernel;

kernel memory requirements.

26. An electronic device, comprising:

one or more processors; and

a memory having stored therein computer executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-13.

27. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any of claims 1-13.