CN112016681A

CN112016681A - Decomposition of machine learning operations

Info

Publication number: CN112016681A
Application number: CN202010374389.1A
Authority: CN
Inventors: C·凯特萨克里斯
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2019-05-31
Filing date: 2020-05-06
Publication date: 2020-12-01
Anticipated expiration: 2040-05-06
Also published as: CN112016681B

Abstract

The present disclosure relates to decomposition of machine learning operations. Disclosed herein is a subject technology that receives a representation of a Neural Network (NN) model to be executed on an electronic device, the representation of the NN model including nodes corresponding to an intermediate layer of the NN model. The subject technology determines, for the respective operation corresponding to each node in each respective middle tier of the NN model, a respective set of operations mathematically equivalent to the respective operation, such that an aggregation of outputs of the respective set of operations is equivalent to an output of the respective operation. The subject technology generates a graph based on each respective set of operations, where the graph includes a set of branches, each branch including a plurality of operations. The subject technology determines a respective order for executing each branch of the graph.

Description

Decomposition of machine learning operations

Cross Reference to Related Applications

This application claims the benefit OF U.S. provisional patent application serial No. 62/855,850 entitled "demon position OF MACHINE LEARNING OPERATIONS" filed on 31/5/2019, which is incorporated herein by reference in its entirety and forms part OF the present U.S. utility patent application for all purposes.

Technical Field

The present specification relates generally to machine learning operations, including decomposing machine learning operations to perform more efficiently on a target platform.

Background

Software engineers and scientists have been using computer hardware for machine learning to improve across different industry applications, including image classification, video analysis, speech recognition, and natural language processing. Notably, neural networks are utilized more frequently to create systems that can perform different computational tasks based on training of large amounts of data.

Drawings

Some of the features of the subject technology are set forth in the appended claims. However, for purposes of explanation, several embodiments of the subject technology are set forth in the following figures.

FIG. 1 illustrates an example network environment in accordance with one or more implementations.

Fig. 2 illustrates an example software architecture for performing a decomposition process for operation of a neural network, according to one or more implementations.

FIG. 3 illustrates an example data flow from various nodes in a portion of a neural network, in accordance with one or more implementations.

FIG. 4 illustrates an example data flow after a portion of the neural network depicted in FIG. 3 has undergone a decomposition process, according to one or more implementations.

FIG. 5 illustrates an example of a first neural network and a second neural network that have undergone a decomposition process in accordance with one or more implementations.

Fig. 6 illustrates a flow diagram of an example process for performing a decomposition process for a neural network, in accordance with one or more implementations.

FIG. 7 illustrates an electronic system with which one or more implementations of the subject technology may be implemented.

Detailed Description

The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology may be practiced. The accompanying drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. The subject technology is not limited to the specific details set forth herein, however, and may be practiced with one or more other implementations. In one or more implementations, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.

The popularity of machine learning has risen dramatically in recent years due to the availability of large amounts of training data and the advancement of more powerful and efficient computing hardware. One popular machine learning technique is to utilize deep neural networks to perform a set of machine learning tasks. To train the deep neural network, a general method is to utilize a Graphics Processing Unit (GPU), and the general method is also used to perform the deep neural network on new input data post-training. However, in some instances, when a given deep neural network is executed, different operations of the deep neural network may require memory access to (e.g., writing to and/or reading from) slower memory (e.g., off-chip memory) because the output of the operation is too large to be stored in faster memory, such as an on-chip cache (e.g., L1, L2) on the target device executing the network. For example, the output of a node in a deep neural network may provide data that is too large to be stored in an on-chip cache of a target device executing the deep neural network, but instead such data is stored in a slower memory such as DRAM. Thus, the deep neural network may perform slower when data is read and written to the DRAM.

Implementations of the subject technology described herein reduce memory traffic per operation of a neural network by performing a decomposition process that splits a given operation of the neural network into various operations having outputs that can fit within a cache of a target device executing the network. Thus, the performance of the neural network may be improved by avoiding access to slower memory (e.g., DRAM), which may be necessary when the decomposition process is not performed. Advantageously, the accuracy of the network is not affected by the decomposition process described herein. Thus, these benefits are understood to improve the computing functionality of a given electronic device, such as an end-user device, which may typically have fewer available computing resources than, for example, one or more cloud-based servers.

FIG. 1 illustrates an example network environment 100 in accordance with one or more implementations. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided.

Network environment 100 includes electronic device 110, electronic device 115, and server 120. Network 106 communicatively couples (directly or indirectly) electronic device 110 and/or server 120, electronic device 115 and/or server 120 and/or electronic device 110 and/or electronic device 115. In one or more implementations, the network 106 may be an interconnected network that may include the internet or a device communicatively coupled to the internet. For purposes of explanation, network environment 100 is shown in FIG. 1 as including electronic device 110, electronic device 115, and server 120; however, network environment 100 may include any number of electronic devices and any number of servers.

The electronic device 110 may be, for example, a desktop computer, a portable computing device such as a laptop computer, a smartphone, a peripheral (e.g., digital camera, headset), a tablet device, a wearable device such as a watch, a band, and so forth. In FIG. 1, by way of example, the electronic device 110 is depicted as a desktop computer. The electronic device 110 may be and/or may include all or part of an electronic system discussed below with respect to fig. 7.

In one or more implementations, the electronic device 110 may provide a system for splitting operations from a neural network model into code of a particular programming language (e.g., C code, C + + code, Swift code). In particular, the subject system may include a neural network compiler for compiling code. In an example, using compiled code, the subject system can create an executable software package for deployment on a target platform (such as electronic device 115) with the assistance of server 120. When the compiled code is executed, the target platform may perform one or more given operations of the neural network model.

The electronic device 115 may be, for example, a portable computing device such as a laptop computer, a smartphone, a peripheral device (e.g., digital camera, headset), a tablet device, a wearable device such as a watch, band, etc., or any electronic device. The electronic device may also include processors with different computing capabilities, including, for example, a CPU, a GPU, and/or a neural processor. In fig. 1, by way of example, the electronic device 115 is depicted as a smartphone device. In one or more implementations, the electronic device 115 may be and/or may include all or a portion of the electronic devices discussed below with respect to the electronic system discussed below with respect to fig. 7.

In one or more implementations, the server 120 deploys compiled code included in an executable software package to a target device for execution. In an example, the electronic device 115 may be a target device for receiving a software package with compiled neural network code and for executing the compiled code in a runtime environment of the electronic device 115. The electronic device 115 (or any electronic device that is a target device) includes a framework that enables the framework to perform operations in compiled code of a neural network. A framework may refer to a software environment that provides specific functionality as part of a larger software platform to facilitate software application development.

Fig. 2 illustrates an example software architecture for performing a decomposition process for operation of a neural network, according to one or more implementations. For purposes of illustration, the software architecture is described as being provided by the electronic device 110 of fig. 1, such as by a processor and/or memory of the electronic device 110; however, the software architecture may be implemented by any other electronic device. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided.

As shown, the computing architecture includes a neural network compiler 215. The memory 240 includes Neural Network (NN) model source code 244 that, after being compiled by the neural network compiler 215, generates a Neural Network (NN) binary executable 242 that may be deployed to different target platforms for execution. In an example, the NN model source code 244 may include code for various algorithms that may be utilized separately or in combination to implement particular functions for execution on a given target device. As described above, the target device may include various hardware sensors and different processors (e.g., as provided by the electronic device 115) that may be utilized when running the NN binary executable 242 on the target device. In an example, the particular functionality may include image processing or computer vision related functionality, speech recognition, natural language processing, and so forth.

Although the neural network compiler 215 is disposed on the electronic device 110 in the example of fig. 2, in some implementations, such a compiler may be disposed on a particular electronic device (e.g., the electronic device 115) that locally compiles the source code and executes the compiled code on the same device. In particular implementations, the NN model source code 244 may be compiled for a particular target platform and then deployed to a different device (such as the electronic device 115) for execution. In an example, the NN model source code 244 may include at least code corresponding to a set of operations to be performed by corresponding nodes from each layer of a given NN model. In an example, the code of an operation in a layer of the NN is a respective function call for performing the operation and/or a set of parameters for the function call. Additionally, code corresponding to one or more input and output characteristics, data structures, and characteristic types may be included in the NN model source code 244.

As further shown, the neural network compiler 215 includes an operation decomposition engine 230 that performs a decomposition process on the NN model source code 244 to split respective operations of the nodes of the NN model (e.g., operations that produce an output with a reduced data size to fit a cache) into various decomposition operations. The operation decomposition engine 230 also serves as a scheduling component to determine the order in which to perform the decomposition operations. Such decomposition process as described herein refers to: for a given operation at a node, operations are generated that produce outputs of reduced size (e.g., split operations), each operation providing a particular output of a particular size, such that the size of the output enables the output to be stored within a cache (such as an on-chip cache) of a target device (e.g., electronic device 115). In an example, the size of a cache (such as L2 cache 252 on electronic device 115) is determined based on the underlying hardware architecture of a given electronic device. Thus, the respective size of each split operation is constrained by the size of the cache of the electronic device (e.g., the size of the L2 cache 252 on the electronic device 115). In addition, the operation decomposition engine 230 performs a decomposition process to ensure that: for a given node, the aggregation of the outputs of the split operations is equal to the output of the operation for that node prior to the split. In a specific implementation, the output of the foregoing operations may be in the form of a data structure, such as a container (e.g., a tensor) that may store data in N dimensions (e.g., a matrix, a vector, an array of arrays, etc.).

The operation decomposition engine 230 may obtain source code from the NN model source code 244 and perform decomposition of operations corresponding to the nodes of the NN model represented in the NN model source code 244. In an example, code corresponding to the split operation may be included in the source code. The neural network compiler 215 takes the source code from the operation decomposition engine 230 and compiles the code into an NN binary executable for the target device, which may be stored in the neural network binary executable 242 and then deployed to the target device (e.g., the electronic device 115) for execution.

Although the neural network compiler 215 is disposed on the electronic device 110 in the example of fig. 2, in some implementations, such a compiler may be disposed on a particular electronic device that compiles code for the neural network model and executes the compiled neural network model on the same device.

As described above, the neural network model may be compiled from the NN model source code 244 for a particular target platform and then deployed to a different device (such as the electronic device 115) for execution. As further shown, in a particular implementation, the electronic device 115 includes a system on a chip (SOC) 250. SoC250 includes L2 cache 252, CPU 254, GPU255, and neural processor 256. The electronic device 115 also includes DRAM 258, which is a slower access memory than the L2 cache 252. Accessing the DRAM 258 may consume computing resources of the electronic device 115 as it requires a large amount of power and may affect the performance of the NN model by slowing down the memory-bound layers (e.g., pooling layers, element-by-element layers, etc.) of the NN. In contrast, in implementations, the L2 cache 252 is very fast, but is significantly smaller in size than the DRAM 258. Therefore, typically many of the outputs of the operation of the NN model will not fit into the L2 cache 252. For purposes of illustration, the on-chip cache is depicted in FIG. 2 as L2 cache 252; however, the on-chip cache may be any level of cache, such as L1, L2, L3, L4, and so forth. As shown, in a particular implementation, the L2 cache 252 is included as part of the neural processor 256, and thus, other processors on the SoC250 cannot access the L2 cache 252.

More recently, specialized (e.g., specialized) hardware optimized for performing specific operations from a given NN has been developed. A given electronic device may include a neural processor 256, which may be implemented as circuitry to perform various machine learning operations based on computations including multiplications, additions, and accumulations. Such calculations may be arranged to perform, for example, a convolution of the input data. In an example, the neural processor is specifically configured to execute a machine learning algorithm, typically by operating on a predictive model (such as an NN). In one or more implementations, the electronic device may include a neural processor 256 in addition to the CPU 254 and/or GPU 255.

As discussed herein, CPU 254 may refer to a main processor in a given electronic device that performs the operations of basic arithmetic, logic, control, and input/output operations specified by instructions of a computer program or application, including some operations for neural network models. As discussed herein, GPU255 may refer to a special-purpose electronic circuit designed to perform operations for rendering graphics that, in many instances, is also utilized to process computational workloads (e.g., specified by instructions of a computer program or application program) for machine learning operations. The CPU 254, GPU255, and neural processor 256 may each have different computational specifications and capabilities, depending on their respective implementations, where each of the foregoing components may provide different levels of performance for certain operations than others.

As discussed herein, a convolutional neural network refers to a particular type of neural network, but uses different types of layers consisting of nodes that exist in three dimensions, where the dimensions may vary between layers. In a convolutional neural network, nodes in a layer may only be connected to a subset of nodes in a previous layer. The final output layer may be fully connected and may be sized according to the number of classifiers. And in some instances, the convolutional neural network model may include various combinations of the following types of layers, multiples of each of them, and their orders: input layer, convolutional layer, pooling layer, linear rectifier unit layer (ReLU), and fully connected layer. A portion of the operations performed by the convolutional neural network include obtaining a set of filters (or kernels) that iterate over the input data based on one or more parameters. In an example, the depth of the convolutional layer may be equal to the number of filters used. It should be understood that given the hyper-parameters of the convolutional neural network, the size of the different volumes at each layer can be mathematically determined.

Because of the volume of data processed, convolutional neural networks typically run on cloud-based computing platforms. In such instances, memory management is often a posterior consideration because cloud-based systems have no real memory issues (e.g., greater computing power/memory is freely available). In contrast, it may not be possible or practical to store all of the weights and resulting node values of the convolutional neural network in memory on a memory-constrained device (e.g., a mobile electronic device such as a smartphone).

Fig. 3 illustrates example data flows from various nodes in a portion of a neural network 300, according to one or more implementations. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided.

As shown, the portion of the neural network 300 includes data 302, data 304, and data 306. This portion of the neural network 300 includes a pooling layer 303 and a convolutional layer 305. In the example of fig. 3, the portion of the neural network 300 has not yet undergone the decomposition process by the operation decomposition engine 230. At time t0, data 302 corresponds to an output of size 4 Megabytes (MB). At time t1, data 304 corresponds to an output of size 4 Megabytes (MB). At time t2, data 306 corresponds to an output of size 2.4 Megabytes (MB). In fig. 3, 8MB of data is transferred between the pooling layer 303 and the convolutional layer 305 without utilizing a cache. In an example, the pooling layer 303 performs a pooling operation on the data 302, which is received as input data by the pooling layer 303. The pooling operation may include a maximum pooling operation (e.g., reporting the maximum output within a rectangular neighborhood), an average of a rectangular neighborhood, a Euclidean norm of a rectangular neighborhood, or a weighted average based on distance from a center pixel. Convolutional layer 305 performs a convolution operation on data 304 and provides data 306 as output. In an example, the convolution operation may include performing an affine transformation and/or filtering.

The following discussion describes examples of decomposition processes or how such mechanisms are reconstructed.

The following steps are provided for a simple convolution with optional padding. In this example, generalization may account for other parameters (stride):

1) a "logical area" is defined as a coordinate range plus padding (if any) in all directions (top, right, bottom, left).

2) A function is constructed that identifies the logical regions of the inputs needed to produce the logical regions of the outputs.

3) All involved regions are recursively applied displayed in a bottom-up fashion (output of last operator towards input of first operator).

For the example of fig. 3, the following process may be performed:

pick up the splitting factor, height 50 and width 200

This defines two regions: (0,0,0) · (63,49,199) and (0,50,0) · (63,99,199)

Here the area is given in the form of (channel, height, width) based on 0;

for the last operator (convolution 3X 3)

(0,0,0) · (63,49,199) of-306 (T2) requires input (0,0,0) · (95,51,201) from 304(T1)

(0,50,0) · (63,99,199) of-306 (T2) requires input (0,50,0) · (95,101,201) from 304(T1)

For the first operator (pooled 3X 3)

-304(T1) (0,0, 0.) (95,51,201) requires input (0,0, 0.) (95,53,203) from 203(T0)

-304(T1) (0,50, 0. (95,101,201) requires input (0,50, 0. (95,103,203) from 203(T0)

The intermediate/temporal tensors (i.e., T1(a) and T1(b)) will have the dimensions of their corresponding regions:

404(T1(a)) of (0,0,0). (95,51,201) logically 304(T1) is 96 × 52 × 202

405(T1(a)) of (0,50,0). (95,101,201) logically 304(T1) is 96 × 52 × 202

In the above example, the presence of input strides expands the area of desired input, while the presence of padding means that some areas may have show padding while others do not.

Fig. 4 illustrates an example data flow 400 after a portion of the neural network 300 depicted in fig. 3 has undergone a decomposition process, according to one or more implementations. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided. Fig. 4 will be discussed with reference to the components of fig. 3.

In the example of fig. 4, the operation decomposition engine 230 has performed a decomposition process on the pooling layer 303 and the convolutional layer 305 of the portion of the neural network 300 in fig. 3. For example, pooling layer 303 has been split into a pooling layer 410 and a pooling layer 420. In this way, pooling

layers

410 and 420 have been generated as a result of the decomposition process performed by the operation decomposition engine 230. In addition, convolutional layer 305 has been split into convolutional layer 412 and convolutional layer 422, which are the results of the decomposition process performed by operational decomposition engine 230.

As further shown, data 402 is provided as input to a pooling layer 410 that performs pooling operations. Pooling layer 410 provides data 404 as output, which is received as input data to convolutional layer 412. Convolutional layer 412 performs a convolution operation on data 404 and provides data 406 as output.

As further shown, the data 403 is provided as input to a pooling layer 420 that performs pooling operations. Pooling layer 420 provides data 405 as output, which is received as input data to convolutional layer 422. Convolutional layer 422 performs a convolution operation on data 405 and provides data 407 as output.

As shown in fig. 4, the shaded regions corresponding to the data 406 and the data 407 are each (64 × 50 × 100). In this example, data 406 and data 407 correspond to data 306 in fig. 3 (T2). The shaded area is projected against the data 306T2 because the result is stored into the original T2 corresponding to the data 306. Thus, in this example, T2 is not decomposed, but only the computation.

Further, in this example, T0 (e.g., corresponding to data 302) is not broken down into smaller objects, and T0 is read directly from the region needed for computation. As shown, the shaded regions in the data 402 and the data 403 are each (96 × 54 × 104), and the data 402 and the data 403 collectively include the data 302 (T0). In this example, the two regions overlap each other.

In the example of fig. 4, each object generated has a height and width that is 2 elements less than its input. The type of layer, kernel dimensions, stride, and fill determine how many inputs will be needed per 1 x 1 output. Although in this example, the input amount is the same for both the convolutional and pooling layers, this is shown as such for simplicity only.

In the example of fig. 4, data 404 and data 405 have common elements (two middle rows). This is effectively a redundant computation and redundant read for T0. In this example, convolutional layer 412 and convolutional layer 422 use the same coefficients for convolution, which are also read from DRAM.

Fig. 5 illustrates an example of a neural network 500 and a neural network 550 that have undergone a decomposition process in accordance with one or more implementations.

A Neural Network (NN) may be represented in a directed graph as a single path (e.g., a producer/consumer chain) including 1) respective nodes representing operations of a layer of the NN, and 2) other nodes representing respective outputs of each of the operation nodes. In such a graph, a node representing the output of one node (corresponding to an operation) will be consumed by subsequent nodes corresponding to different operations in the graph. The following example illustrates this concept:

[ node 1: operation 1] - > [ node 2: output of operation 1] - - > [ node 3: operation 2- > [ node 4: output of operation 2], etc.

In the above example, memory (e.g., cache or DRAM) is required to store the outputs corresponding to node 2 and node 4. The operation decomposition engine 230 of the NN compiler 215 determines how to split operations in each layer of the NN 500 into multiple branches, where each branch includes multiple operations, and where objects generated by the operations in each branch will fit into a cache (e.g., the L2 cache 252) on the target device. In an example, the aggregation of the outputs of these branches is mathematically equivalent to the corresponding operations in the layer of the NN (e.g., the original NN before being split into branches). Furthermore, each branch is fully executed before switching to execute another branch.

In a particular implementation, the operation decomposition engine 230 determines 1) the size of the objects corresponding to the input or output data and 2) the order of the objects to enable such objects to be stored in a cache that is accessible directly from the SoC 250. Since caches (e.g., L2 cache 252) are used to store such objects where possible, memory traffic to DRAMs (e.g., DRAM 258) is reduced.

The operation decomposition engine 230 determines a trade-off between: 1) more branches are generated, although each branch introduces additional computational overhead for redundant operations; or 2) perform the operation without splitting it into multiple operations, and result in slower memory access at the cost of non-cacheable data in DRAM. Temporary objects are determined by the neural network compiler 215 to ensure that such objects are short-lived (e.g., generated and consumed in short amounts of time, and subsequently disappear to free up cache memory).

In an example, the operation decomposition engine 230 determines a first set of operations for the respective operations and determines a second set of operations for the respective operations. The operation decomposition engine 230 selects one of the first set of operations or the second set of operations based at least in part on an analysis of the first set of operations and the second set of operations. In an example, the analysis may indicate which of the first and second sets of operations utilizes fewer resources (e.g., memory) to facilitate selection by the operation decomposition engine 230.

In an example, the CNN operator shrinks, keeps the same, or even expands the dimensions (all three dimensions) of its output. This can adversely affect the number of operators involved in splitting the output and "chains". The more disparate the input/output dimensions, the less efficient the L2 cache space utilization, the greater the number of "branches". The branch count affects the overall mechanism, as each branch results in additional coefficient re-reads and extra computations. This is one reason that smaller "chains" should also be examined.

In an example, the operation decomposition engine 230 first finds the chain of operations that occur in this producer/consumer mode shown in FIG. 3: t0- > T1- > … T _ n

This chain may include N > -2 operators, N +1 tensors, and N-2 intermediate tensors.

The operation decomposition engine 230 then checks if any intermediate tensor T _ i (0< i < n) assigned L2 is not guaranteed. Given the intermediate tensor T _ i, if the combined size of (T _ (i-1) and T _ i) or (T _ i and T _ (i +1)) exceeds L2, then T _ i (along T _ (i-1) and T _ (i +1)) is not guaranteed in L2.

As previously described, there may be redundancy in memory accesses for T0 shown in previous figures. This modification achieves savings in chain intermediates in L2 by breaking down when T0 can end up in DRAM. Thus, the original chain may be trimmed in order to operate on its segments, i.e. in the determined case, T0- >. T8 is replaced with T0- >. T2 and T5- > T8.

As shown in the example of fig. 5, the neural network 500 includes intermediate layers 505, 515, and 525. The intermediate layers 505, 515, and 525 may be different types of layers, such as convolutional layers, ReLU layers, pooling layers, fully-connected layers, etc., that perform respective operations corresponding to the types of layers. Thus, the aforementioned intermediate layers may have different dimensions. As further shown, input data 501 and output data 560 are stored in DRAM. Prior to the decomposition process, the neural network 500 stores the data 510, the data 520, and the data 530 in the DRAM because the respective sizes of the aforementioned data do not fit within the cache memory.

The neural network 500 also shows the dependencies between the different intermediate layers. Thus, the middle tier 515 uses the output of the middle tier 505 (e.g., data 510), and the middle tier 525 uses the output of the middle tier 515.

The operation decomposition engine 230 performs a decomposition process on the intermediate layer 505 and performs an operation O₁Split into three operations O corresponding to intermediate layer 506, intermediate layer 507 and intermediate layer 508, respectively₂、O₃And O₄. In the example shown in the above, the first,operation O₁Corresponding to a pooling level operation, and operation O₂Operation O₃And operation O₄Each a respective pooling-level operation with various hyper-parameters (e.g., spatial intent and/or stride) that affect the size of the output of the corresponding pooling-level operation.

Given a chain, the operation decomposition engine 230 determines the DRAM traffic involved. In a specific implementation, the DRAM traffic involved is due to: 1) tensors not guaranteed in L2; and 2) some operations with kernel coefficients (mostly convolution or convolutional layers).

In an example, when the output (T _ n) is split into multiple parts (2 parts in the beginning, then 3,4 parts, etc.), then all intermediate tensor sizes and all regions involved are calculated until a "split factor" is determined that guarantees that DRAM traffic will drop.

In particular, for convolutional or convolutional layers, kernel coefficients may have to be reread from DRAM, and this must be considered in the implementation.

Next, the operation decomposition engine 230 performs a decomposition process on the intermediate layer 515 and an operation O₅Split into three operations O corresponding to intermediate layer 516, intermediate layer 517, and intermediate layer 518, respectively₆、O₇And O₈。

In addition, operation decomposition engine 230 performs a decomposition process on intermediate layer 525 and performs operation O₉Split into three operations O corresponding to intermediate layer 526, intermediate layer 527, and intermediate layer 528, respectively₁₀、O₁₁And O₁₂。

In this example, the operation decomposition engine 230 may group the decomposition operations into different execution branches for the network. For example, branch 570 includes intermediate layer 506, intermediate layer 516, and intermediate layer 526. In addition, branch 572 includes intermediate layer 507, intermediate layer 517, and intermediate layer 527. In addition, branch 574 includes intermediate layer 508, intermediate layer 518, and intermediate layer 528.

To provide input data to the middle tier of the initial set, the operation decomposition engine 230 performs a decomposition process on the data 501. As shown, when the network is executed on the target device, data 501 is split into data 502, data 503, and data 504, which are provided as input data to intermediate layer 506, intermediate layer 507, and intermediate layer 508, respectively.

The following discussion describes data flows throughout the network. Each of the intermediate layers 506, 507, and 508 performs a respective operation and generates as output data 511, 512, and 513, respectively. As shown, data 511, data 512, and data 513 are provided to intermediate layer 516, intermediate layer 517, and intermediate layer 518, respectively. Each of the

middle layers

516, 517, and 518 performs a respective operation and generates as output data 521, 522, and 523, respectively. Further, as shown, data 521, data 522, and data 523 are provided to intermediate layer 526, intermediate layer 527, and intermediate layer 528, respectively. Each of intermediate layer 526, intermediate layer 527, and intermediate layer 528 performs a corresponding operation and generates data 531, data 532, and data 533 as outputs, respectively.

For each branch, the decomposition process performed by the operation decomposition engine 230 has split the original middle layer into multiple operations, with the respective middle layer from each branch providing output data that can fit into the cache as shown in fig. 5, thereby minimizing utilization of memory bandwidth on the target device (e.g., by forgoing memory accesses to slower DRAMs).

In this example, the operation decomposition engine 230 determines the order in which each of the operations of the middle tier corresponding to the aforementioned branches are performed. For example, operation decomposition engine 230 may determine the order in which branch 570, branch 572, and branch 574 are executed. In particular implementations, each branch is fully executed before another branch is selected for execution.

As further shown, output layer 540 receives data 531, data 532, and data 533. The aggregation of data 531, data 532, and data 533 is equivalent to data 530 in neural network 500, which ensures that the accuracy of neural network 550 is not affected by the decomposition process. In this example, the output layer 540 in the neural network 500 and the neural network 550 perform the same operations to provide output data 560 (which is the equivalent data in both networks in fig. 5). In an example, data 531, 532, and 533 may be aggregated by aggregating each of the foregoing data together or using a deep join technique to provide output data 560.

In implementations, there is no "aggregation" per se (e.g., no additional move/copy). For example, when splitting the output, the computation is split into multiple regions (e.g., the shaded region of T2 in fig. 4); the result is written directly into the final buffer (T2).

In a specific implementation, these logical areas have coordinates. For example, data 407 corresponds to a T2 region starting from (50,0) to include (99,199) using an index based on 0 and a (height, width) orientation. In addition, when convolutional layer 422 produces results, these results are directed into T2.

In particular implementations, the decomposition operations of the neural network 550 (e.g., additional branches of the layer corresponding to the operations performed for each branch, including the order in which each of these branches is executed) may be included with code for the network compiled into a binary executable file.

Fig. 6 illustrates a flow diagram of an example process 600 for performing a decomposition process for a neural network, according to one or more implementations. For purposes of illustration, process 600 is described herein primarily with reference to components of the software architecture of fig. 2, which may be executed by one or more processors of electronic device 110 of fig. 1. However, process 600 is not limited to electronic device 110, and one or more blocks (or operations) of process 600 may be performed by one or more other components of other suitable devices, such as by electronic device 115. Further for purposes of illustration, the blocks of process 600 are described herein as occurring sequentially or linearly. However, multiple blocks of process 600 may occur in parallel. Further, the blocks of the process 600 need not be performed in the order shown, and/or one or more blocks of the process 600 need not be performed and/or may be replaced by other operations.

The operation decomposition engine 230 receives a representation of a Neural Network (NN) model to be executed on an electronic device (610). In an example, the representation of the NN model includes nodes corresponding to intermediate layers of the NN model, where at least some of the nodes each correspond to a respective operation of a respective intermediate layer of the NN model to be performed by the electronic device.

The operation decomposition engine 230 determines, for a respective operation corresponding to each node in each respective middle layer of the NN model, a respective set of operations mathematically equivalent to the respective operation, such that an aggregation of outputs of the respective set of operations is equivalent to an output of the respective operation (612).

The operation decomposition engine 230 generates a graph based on each respective set of operations, wherein the graph includes a set of branches, each branch including a plurality of operations including a particular operation from each respective set of operations (614).

The operation decomposition engine 230 determines the corresponding order for executing each branch of the graph (616).

Fig. 7 illustrates an electronic system 700 with which one or more implementations of the subject technology may be implemented. Electronic system 700 may be and/or may be part of electronic device 110, electronic device 115, and/or server 120 shown in fig. 1. Electronic system 700 may include various types of computer-readable media and interfaces for various other types of computer-readable media. Electronic system 700 includes a bus 708, one or more processing units 712, system memory 704 (and/or caches), ROM 710, persistent storage 702, an input device interface 714, an output device interface 706, and one or more network interfaces 716, or subsets and variations thereof.

Bus 708 generally represents all of the system bus, peripheral buses, and chipset buses that communicatively connect the many internal devices of electronic system 700. In one or more implementations, the bus 708 communicatively connects one or more processing units 712 with the ROM 710, the system memory 704, and the permanent storage device 702. One or more processing units 712 retrieve instructions to be executed and data to be processed from these various memory units in order to perform the processes of the subject disclosure. In different implementations, the one or more processing units 712 may be a single processor or a multi-core processor.

The ROM 710 stores static data and instructions for the one or more processing units 712, as well as other modules of the electronic system 700. On the other hand, persistent storage 702 may be a read-write memory device. Persistent storage 702 may be a non-volatile memory unit that stores instructions and data even when electronic system 700 is turned off. In one or more implementations, a mass storage device (such as a magnetic or optical disk and its corresponding disk drive) may be used as the persistent storage 702.

In one or more implementations, a removable storage device (such as a floppy disk, a flash drive, and their corresponding disk drives) may be used as the permanent storage device 702. Like the persistent storage device 702, the system memory 704 may be a read-write memory device. However, unlike the persistent storage device 702, the system memory 704 may be a volatile read-and-write memory, such as a random access memory. The system memory 704 may store any of the instructions and data that may be needed by the one or more processing units 712 at runtime. In one or more implementations, the processes of the subject disclosure are stored in system memory 704, permanent storage 702, and/or ROM 710. One or more processing units 712 retrieve instructions to be executed and data to be processed from these various memory units in order to perform one or more embodied processes.

The bus 708 is also connected to an input device interface 714 and an output device interface 706. The input device interface 714 enables a user to communicate information and select commands to the electronic system 700. Input devices that may be used with input device interface 714 may include, for example, alphanumeric keyboards and pointing devices (also referred to as "cursor control devices"). The output device interface 706 may, for example, enable display of images generated by the electronic system 700. Output devices that may be used with output device interface 706 may include, for example, printers and display devices, such as Liquid Crystal Displays (LCDs), Light Emitting Diode (LED) displays, Organic Light Emitting Diode (OLED) displays, flexible displays, flat panel displays, solid state displays, projectors, or any other device for outputting information. One or more implementations may include a device that acts as both an input device and an output device, such as a touch screen. In these implementations, the feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.

Finally, as shown in FIG. 7, the bus 708 also couples the electronic system 700 to one or more networks and/or to one or more network nodes, such as the electronic device 115 shown in FIG. 1, through one or more network interfaces 716. In this manner, electronic system 700 may be part of a computer network, such as a LAN, wide area network ("WAN"), or intranet, or may be part of a network of networks, such as the internet. Any or all of the components of electronic system 700 may be used with the subject disclosure.

One aspect of the disclosed technology may include applying machine learning and/or compiler techniques that may perform operations on user data. The present disclosure contemplates that, in some instances, the user data may include personal information data that uniquely identifies or may be used to identify a particular person. Such personal information data may include demographic data, location-based data, online identifiers, phone numbers, email addresses, home addresses, data or records related to the user's health or fitness level (e.g., vital signs measurements, medication information, exercise information), date of birth, or any other personal information.

The present disclosure recognizes that the use of such personal information data in the present technology may be useful to benefit the user. For example, personal information data may be used to perform machine learning tasks that provide results (e.g., predictions) of interest to a user. Thus, the use of such personal information data enables the user to have greater control over the results delivered. In addition, the present disclosure also contemplates other uses for which personal information data is beneficial to a user. For example, health and fitness data may be used according to a user's preferences to provide insight into their overall health status, or may be used as positive feedback to individuals using technology to pursue a health goal.

The present disclosure contemplates that entities responsible for collecting, analyzing, disclosing, transmitting, storing, or otherwise using such personal information data will comply with established privacy policies and/or privacy practices. In particular, it would be desirable for such entities to implement and consistently apply privacy practices generally recognized as meeting or exceeding industry or government requirements to maintain user privacy. Such information regarding usage of personal data should be prominently and conveniently accessible to the user and should be updated as the data is collected and/or used. The user's personal information should be accessed for legitimate use only. In addition, such access should only occur after receiving user consent or other legal grounds as set forth in applicable law. Furthermore, such entities should consider taking any necessary steps to defend and secure access to such personal information data, and to ensure that others who have access to the personal information data comply with their privacy policies and procedures. In addition, such entities may subject themselves to third party evaluations to prove compliance with widely accepted privacy policies and practices. In addition, policies and practices should be tailored to the particular type of personal information data that is conveniently collected and/or accessed, and made applicable to applicable laws and standards, including jurisdiction-specific considerations that may be used to impose higher standards. For example, in the united states, the collection or acquisition of certain health data may be governed by federal and/or state laws, such as the health insurance association and accountability act (HIPAA); while other countries may have health data subject to other regulations and policies and should be treated accordingly.

Regardless of the foregoing, the present disclosure also contemplates embodiments in which a user selectively blocks use or access to personal information data that a component of the system described herein may attempt to access. That is, the present disclosure contemplates that hardware elements and/or software elements may be provided to prevent or block access to such personal information data. For example, in the case of an ad delivery service, the present technology may be configured to allow a user to opt-in or opt-out of participating in the collection of personal information data at any time during or after registration service. In another example, the user may choose not to provide emotion-related data for the targeted content delivery service. As another example, the user may choose to limit the length of time that mood-related data is maintained, or to prevent the development of the underlying emotional condition altogether. In addition to providing "opt-in" and "opt-out" options, the present disclosure contemplates providing notifications related to accessing or using personal information. For example, the user may be notified that their personal information data is to be accessed when the application is downloaded, and then be reminded again just before the personal information data is accessed by the application.

Further, it is an object of the present disclosure that personal information data should be managed and processed to minimize the risk of inadvertent or unauthorized access or use. Once the data is no longer needed, the risk can be minimized by limiting data access and deleting the data. In addition, and when applicable, including in certain health-related applications, data de-identification may be used to protect the privacy of the user. De-identification may be facilitated by removing identifiers, controlling the amount or specificity of stored data (e.g., collecting location data at a city level rather than at an address level), controlling how data is stored (e.g., aggregating data among users), and/or other methods such as differential privacy, as appropriate.

Thus, while the present disclosure broadly covers the use of personal information data to implement one or more of the various disclosed embodiments, the present disclosure also contemplates that various embodiments may be implemented without the need to access such personal information data. That is, various embodiments of the present technology do not fail to function properly due to the lack of all or a portion of such personal information data. For example, content may be selected and delivered to a user based on aggregated non-personal information data or an absolute minimum amount of personal information, such as content that is processed only on the user's device or other non-personal information that may be available to a content delivery service.

Implementations within the scope of the present disclosure may be realized, in part or in whole, by a tangible computer-readable storage medium (or multiple tangible computer-readable storage media of one or more types) having one or more instructions written thereon. The tangible computer readable storage medium may also be non-transitory in nature.

A computer-readable storage medium may be any storage medium that can be read, written, or otherwise accessed by a general purpose or special purpose computing device and that includes any processing electronics and/or processing circuitry capable of executing instructions. For example, without limitation, the computer-readable medium may include any volatile semiconductor memory, such as RAM, DRAM, SRAM, T-RAM, Z-RAM, and TTRAM. The computer readable medium may also include any non-volatile semiconductor memory, such as ROM, PROM, EPROM, EEPROM, NVRAM, flash memory, nvSRAM, FeRAM, FeTRAM, MRAM, PRAM, CBRAM, SONOS, RRAM, NRAM, racetrack memory, FJG, and Millipede memory.

Further, the computer-readable storage medium may include any non-semiconductor memory, such as optical disk storage, magnetic tape, other magnetic storage devices, or any other medium capable of storing one or more instructions. In one or more implementations, the tangible computer-readable storage medium may be directly coupled to the computing device, while in other implementations, the tangible computer-readable storage medium may be indirectly coupled to the computing device, e.g., via one or more wired connections, one or more wireless connections, or any combination thereof.

The instructions may be directly executable or may be used to develop executable instructions. For example, the instructions may be implemented as executable or non-executable machine code, or may be implemented as high-level language instructions that may be compiled to produce executable or non-executable machine code. Further, instructions may also be implemented as, or may include, data. Computer-executable instructions may also be organized in any format, including routines, subroutines, programs, data structures, objects, modules, applications, applets, functions, and the like. As those skilled in the art will recognize, details including, but not limited to, number, structure, sequence, and organization of instructions may vary significantly without changing the underlying logic, function, processing, and output.

Although the above discussion has primarily referred to microprocessor or multi-core processors executing software, one or more implementations are performed by one or more integrated circuits, such as ASICs or FPGAs. In one or more implementations, such integrated circuits execute instructions stored on the circuit itself.

Those skilled in the art will appreciate that the various illustrative blocks, modules, elements, components, methods, and algorithms described herein may be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. The various components and blocks may be arranged differently (e.g., arranged in a different order, or divided in a different manner) without departing from the scope of the subject technology.

It is to be understood that the specific order or hierarchy of blocks in the processes disclosed is an illustration of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of blocks in the processes may be rearranged or that all illustrated blocks may be performed. Any of these blocks may be performed simultaneously. In one or more implementations, multitasking and parallel processing may be advantageous. Moreover, the division of various system components in the implementations described above should not be understood as requiring such division in all implementations, and it should be understood that program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

As used in this specification and any claims of this patent application, the terms "base station," "receiver," "computer," "server," "processor," and "memory" all refer to electronic or other technical devices. These terms exclude a person or group of persons. For the purposes of this specification, the term "display" or "displaying" means displaying on an electronic device.

As used herein, the phrase "at least one of," following the use of the term "and" or "to separate a series of items from any one of the items, modifies the list as a whole and not every member of the list (i.e., every item). The phrase "at least one of" does not require the selection of at least one of each of the items listed; rather, the phrase allows the meaning of at least one of any one item and/or at least one of any combination of items and/or at least one of each item to be included. For example, the phrases "at least one of A, B and C" or "at least one of A, B or C" each refer to a only, B only, or C only; A. any combination of B and C; and/or A, B and C.

The predicate words "configured to", "operable to", and "programmed to" do not imply any particular tangible or intangible modification to a certain subject but are intended to be used interchangeably. In one or more implementations, a processor configured to monitor and control operations or components may also mean that the processor is programmed to monitor and control operations or that the processor is operable to monitor and control operations. Also, a processor configured to execute code may be interpreted as a processor that is programmed to execute code or that is operable to execute code.

Phrases such as an aspect, the aspect, another aspect, some aspects, one or more aspects, a specific implementation, the specific implementation, another specific implementation, some specific implementation, one or more specific implementations, embodiments, the embodiment, another embodiment, some embodiments, one or more embodiments, configurations, the configuration, other configurations, some configurations, one or more configurations, the subject technology, the disclosure, the present disclosure, other variations thereof, and the like are for convenience and do not imply that a disclosure relating to such phrase or phrases is essential to the subject technology, nor that such disclosure applies to all configurations of the subject technology. Disclosure relating to such one or more phrases may apply to all configurations or one or more configurations. Disclosure relating to such one or more phrases may provide one or more examples. Phrases such as an aspect or some aspects may refer to one or more aspects and vice versa and this applies similarly to the other preceding phrases.

The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" or as "exemplary" is not necessarily to be construed as preferred or advantageous over other implementations. Furthermore, to the extent that the terms "includes," has, "" having, "" has, "" with, "" has, "" having, "" contains, "" containing, "" contain.

All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element should be construed in accordance with the provisions of 35u.s.c. § 112(f), unless the element is explicitly recited using the phrase "means for … …", or for method claims, the element is recited using the phrase "step for … …".

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the language claims, wherein reference to an element in a singular value is not intended to mean "one only" and means "one or more" unless specifically so stated. The term "some" means one or more unless specifically stated otherwise. Pronouns for men (e.g., his) include women and neutrals (e.g., her and its), and vice versa. Headings and sub-headings (if any) are used for convenience only and do not limit the subject disclosure.

Claims

1. A method, comprising:

receiving a representation of a Neural Network (NN) model to be executed on an electronic device, the representation of the NN model comprising nodes corresponding to intermediate layers of the NN model, wherein at least some of the nodes each correspond to a respective operation of a respective intermediate layer of the NN model to be executed by the electronic device;

determining, for the respective operation corresponding to each node in each respective intermediate layer of the NN model, a respective set of operations mathematically equivalent to the respective operation such that an aggregation of outputs of the respective set of operations is equivalent to an output of the respective operation;

generating a graph based on each respective set of operations, wherein the graph includes a set of branches, each branch including a plurality of operations including a particular operation from each respective set of operations;

determining a respective order for executing each branch of the graph; and

storing the graph and the corresponding order.

2. The method of claim 1, further comprising:

compiling a binary package for the electronic device based at least in part on the graph and the respective order for executing each branch of the graph, wherein the electronic device executes each respective set of operations based on the respective order.

3. The method of claim 1, wherein determining the respective set of operations for the respective operation corresponding to each node in each respective intermediate layer of the NN model further comprises:

determining a first plurality of operations for the respective operation;

determining a second plurality of operations for the respective operation; and is

Selecting one of the first plurality of operations or the second plurality of operations based at least in part on an analysis of the first plurality of operations and the second plurality of operations, wherein the analysis indicates which of the first plurality of operations and the second plurality of operations utilizes less memory resources, wherein selecting one of the first plurality of operations or the second plurality of operations is further based on statistics and a set of heuristics that utilize the statistics, the statistics indicating computational overhead and memory accesses.

4. The method of claim 1, wherein the output from each operation of the respective set of operations is constrained based at least in part on an amount of available memory in a cache of the electronic device.

5. The method of claim 4, wherein the aggregation of the outputs of the respective set of operations is stored in a memory of the electronic device, the memory being a slower memory than the cache of the electronic device.

6. The method of claim 1, wherein the plurality of operations of each branch begin after an input node of the NN model and end before an output node of an output layer of the NN model.

7. The method of claim 1, wherein the plurality of operations of each branch provides a portion of an output of the NN model from an output layer.

8. The method of claim 7, wherein an aggregation of each output of each branch is equal to the output of the NN model from the output layer.

9. The method of claim 8, wherein the output of the NN model from the output layer is stored in a Dynamic Random Access Memory (DRAM).

10. The method of claim 1, wherein the electronic device comprises cache memory and Dynamic Random Access Memory (DRAM).

11. A system, comprising:

a processor;

a memory device including instructions that, when executed by the processor, cause the processor to:

receiving a representation of a Neural Network (NN) model to be executed on an electronic device, the representation of the NN model comprising nodes corresponding to layers of the NN model, wherein at least one of the nodes corresponds to an operation of a corresponding layer of the NN model to be executed by the electronic device;

determining a set of operations mathematically equivalent to the operation such that an aggregation of the outputs of the set of operations is equivalent to the output of the operation;

generating a graph based on the set of operations, wherein the graph includes a set of branches, each branch including a plurality of operations including at least one operation from the set of operations;

determining a respective order for executing each branch of the graph; and

storing the graph and the respective order for executing each branch of the graph for compilation of the NN model.

12. The system of claim 11, wherein the memory device further includes instructions that, when executed by the processor, further cause the processor to:

compiling a binary package for the electronic device based at least in part on the graph and the respective order for executing each branch of the graph, wherein the electronic device executes each operation of the respective plurality of operations based on the respective order.

13. The system of claim 11, wherein to determine that the set of operations mathematically equivalent to the operation makes the aggregation of the outputs of the set of operations equivalent to the outputs of the respective operation, the processor is further caused to:

determining a first plurality of operations for the respective operation;

14. The system of claim 11, wherein the output from each operation of the set of operations is constrained based at least in part on an amount of available memory in a cache of the electronic device.

15. The system of claim 14, wherein the aggregation of the outputs of the set of operations is stored in a memory of the electronic device, the memory being a slower memory than the cache of the electronic device.

16. The system of claim 11, wherein the set of operations starts after input nodes of the NN model and ends before output nodes of an output layer of the NN model.

17. The system of claim 11, wherein the set of operations for each branch provides a portion of the output of the NN model from an output layer.

18. The system of claim 17, wherein an aggregation of each output of each branch is equal to the output of the NN model from the output layer.

19. The system of claim 18, wherein the output of the NN model from the output layer is stored in DRAM.

20. A non-transitory computer-readable medium comprising instructions that, when executed by a computing device, cause the computing device to perform operations comprising:

receiving a representation of a Neural Network (NN) model to be executed on an electronic device, the representation of the NN model comprising nodes corresponding to layers of the NN model, wherein at least some of the nodes each correspond to a respective operation of a respective layer of the NN model to be executed by the electronic device;

determining, for the respective operation corresponding to each node in each layer of the NN model, a respective set of operations mathematically equivalent to the respective operation such that an aggregation of outputs of the respective set of operations is equivalent to an output of the respective operation;

generating a graph based on each respective set of operations, wherein the graph includes a set of branches, each branch including a plurality of operations including a particular operation from each respective set of operations; and

a respective order for executing each branch of the graph is determined.