WO2019128248A1 - 一种信号处理方法及装置 - Google Patents

一种信号处理方法及装置 Download PDF

Info

Publication number
WO2019128248A1
WO2019128248A1 PCT/CN2018/099733 CN2018099733W WO2019128248A1 WO 2019128248 A1 WO2019128248 A1 WO 2019128248A1 CN 2018099733 W CN2018099733 W CN 2018099733W WO 2019128248 A1 WO2019128248 A1 WO 2019128248A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
fractal
signal
weight
matrices
Prior art date
Application number
PCT/CN2018/099733
Other languages
English (en)
French (fr)
Inventor
许若圣
陈静炜
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP18896420.9A priority Critical patent/EP3663938B1/en
Publication of WO2019128248A1 publication Critical patent/WO2019128248A1/zh
Priority to US16/819,976 priority patent/US20200218777A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the embodiments of the present invention relate to the field of computer technologies, and in particular, to a signal processing method and apparatus.
  • Neural Network is a network structure that mimics the behavioral characteristics of animal neural networks for information processing.
  • the structure is composed of a large number of nodes (or neurons) connected to each other, and the purpose of processing information is achieved by learning and training the input information based on a specific operation model.
  • a neural network includes an input layer, a hidden layer, and an output layer.
  • the input layer is responsible for receiving input signals
  • the output layer is responsible for outputting the calculation result of the neural network
  • the hidden layer is responsible for learning, training, and the like, and is a memory unit of the network, and a memory of the hidden layer.
  • the function is characterized by a weight matrix, which typically corresponds to a weighting factor.
  • Convolutional Neural Network is a multi-layered neural network, each layer consists of multiple two-dimensional planes, and each plane consists of multiple independent neurons, each of which is multiple Neurons share weights, and weight sharing can reduce the number of parameters in the neural network.
  • the convolution operation of the processor usually converts the convolution of the input signal features and weights into a matrix multiplication operation between the signal matrix and the weight matrix.
  • A is a signal matrix
  • B is a weight matrix
  • the processor may lack the ability to calculate a large matrix such as A and B or perform this.
  • the class calculation is costly.
  • the matrix A can be divided into A00, A01, A10 and A11 according to the conditions.
  • the matrix B is divided into B00, B01, B10 and B11 according to the conditions.
  • the corresponding matrix C can be C00, C01, C10 and C11.
  • the four matrix blocks are composed. The relationship between each matrix block in the matrix C and the fractal signal matrix and the fractal weight matrix can be expressed by the following formula.
  • calculation can be performed by data multiplexing to reduce power consumption.
  • the calculation of C00 and C01 multiplexes the data A00, and the read data is reduced.
  • the signal matrix and the weight matrix are fractalized, and the shape of the obtained fractal signal matrix and the fractal weight matrix is fixed, and the power consumption consumed is also fixed, and the design flexibility is insufficient.
  • Embodiments of the present invention provide a signal processing method and apparatus for improving flexibility of a fractal matrix.
  • a signal processing method for use in a device including a processor, the method comprising: acquiring a signal matrix and a weight matrix, the signal matrix being a two-dimensional matrix and comprising a plurality of computer-processable signals to be processed, weights
  • the matrix is a two-dimensional matrix and includes a plurality of weight coefficients, the number of columns of the signal matrix is equal to the number of rows of the weight matrix;
  • the block signal matrix obtains a plurality of first fractal signal matrices of X rows and H columns, and a block weight matrix, Obtaining a plurality of first fractal weight matrices of H rows and Y columns, each of the first fractal signal matrices and each of the first fractal weight matrices satisfying a non-approximate square; and the plurality of first fractal signal matrices and the plurality of first fractal weights
  • the matrix performs matrix multiplication and accumulation operations to obtain a plurality of matrix operation results, and the plurality
  • the block signal matrix and the weight matrix when the processor obtains the signal matrix and the weight matrix, the block signal matrix and the weight matrix obtain a plurality of first fractal signal matrices of X rows and H columns and a plurality of first fractal weights of H rows and Y columns.
  • the matrix since each of the first fractal signal matrix and each of the first fractal weight matrices satisfy a non-approximate square, the flexibility of the fractal matrix is improved to facilitate power consumption optimization design.
  • satisfying the non-approximate square includes: the absolute value of the difference between the number of rows and the number of columns of the matrix is greater than or equal to 2, that is, the first fractal signal matrix and the first fractal weight matrix The absolute value of the difference between the number of rows and the number of columns is greater than or equal to two.
  • the processor includes a first buffer and a second buffer, the block signal matrix, and the plurality of first fractal signal matrices of the X rows and H columns, and the block weight matrix Obtaining a plurality of first fractal weight matrices of the H rows and the Y columns, comprising: reading, by the first buffer, the plurality of first fractal signal matrices of the X rows and H columns from the signal matrix multiple times; respectively, by using the second buffer respectively A plurality of first fractal weight matrices of H rows and Y columns are read from the weight matrix a plurality of times.
  • the processor may read the first fractal signal matrix of the non-approximate square through the first buffer, and read the first fractal weight matrix of the non-approximate square by the second buffer, thereby providing the first The flexibility of a fractal matrix read by a buffer and a second buffer.
  • the processor further includes a third buffer, the method further comprising: writing a matrix multiplication result or an accumulation of the at least two matrix multiplication results to the third buffer.
  • a matrix multiplication operation is performed on a first fractal signal matrix and a first fractal weight matrix to obtain a matrix multiplication result, including: blocking the first fractal signal matrix, and obtaining x a plurality of second fractal signal matrices of row h columns, and a block first fractal weight matrix, to obtain a plurality of second fractal weight matrices of h rows and y columns, each second fractal signal matrix and each second fractal weight matrix Both of the plurality of second fractal signal matrices and the plurality of second fractal weight matrices are matrix-multiplied and accumulated to obtain a plurality of matrix operation results.
  • the processor may further divide the second into a plurality of smaller non-approximate squares.
  • the fractal signal matrix and the second weight matrix are subjected to matrix multiplication and accumulation operations by a plurality of second fractal signal matrices and a plurality of second weight matrices, thereby further improving the flexibility of the fractal matrix.
  • a signal processing apparatus comprising: an acquiring unit, configured to acquire a signal matrix and a weight matrix, the signal matrix is a two-dimensional matrix and includes a plurality of computer-processable signals to be processed, and the weight matrix is two-dimensional
  • the matrix includes a plurality of weight coefficients, the number of columns of the signal matrix is equal to the number of rows of the weight matrix; and the processing unit is configured to block the signal matrix to obtain a plurality of first fractal signal matrices of X rows and H columns, and a block weight matrix Obtaining a plurality of first fractal weight matrices of H rows and Y columns, each first fractal signal matrix and each first fractal weight matrix satisfying a non-approximate square, and a plurality of first fractal signal matrices and a plurality of first
  • the fractal weight matrix performs matrix multiplication and accumulation operations to obtain a plurality of matrix operation results, and the plurality of matrix operation results are used to form signal
  • satisfying the non-approximate square includes that the absolute value of the difference between the number of rows and the number of columns of the matrix is greater than or equal to two.
  • the processing unit includes a first buffer and a second buffer, where the processing unit is configured to: read the X rows and H columns from the signal matrix multiple times by using the first buffer a plurality of first fractal signal matrices; a plurality of first fractal weight matrices of H rows and Y columns are read from the weight matrix by the second buffer, respectively.
  • the processing unit further includes a third buffer, where the processing unit is further configured to: write a matrix multiplication result or an accumulation of at least two matrix multiplication operations to the third buffer .
  • a matrix multiplication operation is performed on a first fractal signal matrix and a first fractal weight matrix
  • the processing unit is further configured to: partition the first fractal signal matrix to obtain x rows h a plurality of second fractal signal matrices of the column, and a block first fractal weight matrix, to obtain a plurality of second fractal weight matrices of h rows and y columns, each second fractal signal matrix and each second fractal weight matrix satisfying Non-approximate squares; performing matrix multiplication and accumulation operations on the plurality of second fractal signal matrices and the plurality of second fractal weight matrices to obtain a plurality of matrix operation results.
  • a signal processing apparatus comprising: an input interface for acquiring a signal matrix and a weight matrix, the signal matrix is a two-dimensional matrix, the weight matrix is a two-dimensional matrix and includes a plurality of weight coefficients, and a signal matrix The number of columns is equal to the number of rows of the weight matrix; the processor is configured to process the block signal matrix to obtain a plurality of first fractal signal matrices of X rows and H columns, and a block weight matrix to obtain H rows Y a plurality of first fractal weight matrices of the column, each of the first fractal signal matrix and each of the first fractal weight matrices satisfying a non-approximate square; multiplying the plurality of first fractal signal matrices and the plurality of first fractal weight matrices by matrix multiplication And accumulating operations to obtain a plurality of matrix operation results, the plurality of matrix operation results being used to form signal processing results, each matrix operation result comprising
  • satisfying the non-approximate square includes: the absolute value of the difference between the number of rows and the number of columns of the matrix is greater than or equal to 2.
  • the processor includes a first buffer and a second buffer, and the processor further performs: reading, by the first buffer, X rows H from the signal matrix multiple times a plurality of first fractal signal matrices of the column; the plurality of first fractal weight matrices of the H rows and Y columns are read from the weight matrix by the second buffer a plurality of times.
  • the processor further includes a third buffer, and the processor further performs: writing a matrix multiplication result or an accumulation of at least two matrix multiplication results to the third buffer.
  • performing a matrix multiplication operation on a first fractal signal matrix and a first fractal weight matrix the processor further performing the following operations: dividing the first fractal signal matrix to obtain x rows a plurality of second fractal signal matrices of the h columns, and a block first fractal weight matrix, to obtain a plurality of second fractal weight matrices of the h rows and y columns, each of the second fractal signal matrices and each of the second fractal weight matrices Satisfying a non-approximate square; performing matrix multiplication and accumulation operations on the plurality of second fractal signal matrices and the plurality of second fractal weight matrices to obtain a plurality of matrix operation results.
  • a still further aspect of the present application provides a computer readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the first aspect or the first aspect described above
  • a signal processing method provided by any of the possible implementations.
  • a computer program product comprising instructions which, when run on a computer, cause the computer to perform the signals provided by any of the first aspect or any of the possible implementations of the first aspect Approach.
  • a processor for: acquiring a signal matrix and a weight matrix, the signal matrix being a two-dimensional matrix and comprising a plurality of computer-processable signals to be processed, the weight matrix a two-dimensional matrix and including a plurality of weight coefficients, the number of columns of the signal matrix being equal to the number of rows of the weight matrix; dividing the signal matrix to obtain a plurality of first fractal signal matrices of X rows and H columns, and Blocking the weight matrix to obtain a plurality of first fractal weight matrices of H rows and Y columns, each first fractal signal matrix and each first fractal weight matrix satisfying a non-approximate square; and the plurality of first fractals Performing matrix multiplication and accumulation operations on the signal matrix and the plurality of first fractal weight matrices to obtain a plurality of matrix operation results, wherein the plurality of matrix operation results are used to form a signal processing result, and each matrix operation result includes a pluralit
  • the satisfying the non-approximation square includes: an absolute value of a difference between a number of rows of the matrix and the number of columns is greater than or equal to 2.
  • the processor includes a first buffer and a second buffer, and the processor further performs: reading, by the first buffer, the signal matrix multiple times Taking a plurality of first fractal signal matrices of X rows and H columns; and reading, by the second buffer, a plurality of first fractal weight matrices of H rows and Y columns, respectively, from the weight matrix.
  • the processor further includes a third buffer, where the processor further performs: writing the matrix multiplication result or at least two of the third buffer The matrix multiplies the result by the accumulation.
  • the processor further performs: dividing the first fractal signal matrix, obtaining a plurality of second fractal signal matrices of x rows and h columns, and partitioning the first fractal a weight matrix, a plurality of second fractal weight matrices of h rows and y columns, each of the second fractal signal matrix and each of the second fractal weight matrices satisfying the non-approximate square; the plurality of second fractal signal matrices Performing matrix multiplication and accumulation operations on the plurality of second fractal weight matrices to obtain a plurality of matrix operation results.
  • the processor includes a computing unit for performing the previously described computational processing.
  • the calculation unit comprises a multiply and accumulate unit.
  • the multiply-accumulate unit is hardware for performing a multiply-accumulate operation.
  • Figure 1 is a schematic diagram of a matrix block
  • FIG. 2 is a schematic structural diagram of a device according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a neural network according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a fully connected neural network according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of a convolution operation according to an embodiment of the present invention.
  • FIG. 7 is a schematic flowchart diagram of a signal processing method according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of a matrix block according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a first fractal signal matrix according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram of a processor according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of another processor according to an embodiment of the present disclosure.
  • FIG. 12 is a schematic structural diagram of a signal processing apparatus according to an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of a device according to an embodiment of the present disclosure.
  • the device may include a memory 201, a processor 202, a communication interface 203, and a bus 204.
  • the memory 201, the processor 202, and the communication interface 203 are connected to one another via a bus 204.
  • the memory 201 can be used for storing data, software programs and modules, and mainly includes a storage program area and a storage data area, the storage program area can store an operating system, an application required for at least one function, and the like, and the storage data area can store the use time of the device. The data created, etc.
  • the processor 202 is configured to control the operation of the device, such as by running or executing a software program and/or module stored in the memory 201, and calling data stored in the memory 201 to perform various functions of the device and Data processing.
  • the communication interface 203 is used to support the device for communication.
  • the processor 202 can include a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. It is possible to implement or carry out the various illustrative logical blocks, modules and circuits described in connection with the present disclosure.
  • the processor may also be a combination of computing functions, for example, including one or more microprocessor combinations, combinations of digital signal processors and microprocessors, and the like.
  • the bus 204 can be a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus or the like.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • FIG. 3 it is a schematic structural diagram of a neural network 300 having N processing layers, N ⁇ 3 and N taking a natural number, and the first layer of the neural network is an input layer 301, which is responsible for receiving an input signal.
  • the last layer of the neural network is the output layer 303, which outputs the processing result of the neural network, and the other layers of the first layer and the last layer are removed as the intermediate layer 304, and the intermediate layers together form the hidden layer 302, each of the hidden layers.
  • the middle layer of the layer can receive both the input signal and the output signal, and the hidden layer is responsible for the processing of the input signal.
  • Each layer represents a logical level of signal processing, through which multiple layers of data can be processed by multiple levels of logic.
  • the processing function may be a modified linear unit (RLU), a hyperbolic tangent function (tanh), or a sigmoid (sigmoid).
  • (x 1 , x 2 , x 3 ) is a one-dimensional signal matrix
  • (h 1 , h 2 , h 3 ) is the output signal matrix
  • W ij represents the weight coefficient between the input x j and the output h i
  • the weight The matrix formed by the coefficients is a weight matrix
  • the weight matrix W corresponding to the one-dimensional signal matrix and the output signal matrix is as shown in the formula (1):
  • Equation (2) The relationship between the input signal and the output signal is as shown in equation (2), where b i is the offset value of the neural network processing function, and the offset value adjusts the input of the neural network to obtain an ideal output result.
  • the input signal of the neural network may be various forms of signals such as a voice signal, a text signal, an image signal, a temperature signal, etc.
  • the voice signal may be a voice signal recorded by a recording device, a mobile phone or a landline telephone.
  • the voice signal received during the call, and the voice signal sent by the radio received by the radio, etc., the text signal may be a TXT text signal, a Word text signal, and a PDF text signal, etc.
  • the image signal may be a landscape signal captured by the camera, and a display monitor
  • the image signal of the community environment captured by the device and the facial signal of the face acquired by the access control system, etc., the input signals of the neural network include other various computer-processable engineering signals, which are not enumerated here.
  • the processing performed by the hidden layer 302 of the neural network may be processing such as removing the mixed noise signal in the speech signal to enhance the speech signal, understanding the specific content in the text signal, and recognizing the facial image signal of the face.
  • Each layer of the neural network may include multiple nodes, which may also be referred to as neurons.
  • a fully connected neural network is a neural network in which all neurons in a neighboring layer are fully connected, that is, all neurons in the previous layer are connected to each neuron in the latter layer.
  • FIG. 4 is a schematic diagram of a structure comprising a three-layer fully connected neural network, both layers 1 and 2 comprising four neurons, and layer 3 comprising one neuron. "+1" in Figure 4 represents biased neurons for adjusting the input of each layer in the neural network. Since the neurons in the adjacent layers of the fully connected network are fully connected, when there are more intermediate layers of the fully connected neural network, the dimensions of the signal matrix and the weight matrix in the lower processing layer will be very large, resulting in nerves. The network size of the network is too large.
  • the convolutional neural network can use a small parameter template to slid and filter on the input signal spatial domain, thereby solving the problem that the network size in the fully connected neural network is too large.
  • Convolutional neural networks differ from ordinary neural networks in that the convolutional neural network consists of a feature extractor consisting of a convolutional layer and a subsampling layer. In the convolutional layer of a convolutional neural network, one neuron is only connected to a portion of the adjacent layer neurons. In a convolutional layer of a convolutional neural network, there are usually several feature planes. Each feature plane consists of a number of rectangularly arranged neurons. The neurons of the same feature plane share weights. The weights shared here are convolutions. nuclear.
  • the convolution kernel is generally initialized in the form of a random fractional matrix, and the convolution kernel will learn to obtain reasonable weights during the training of the network.
  • the immediate benefit of convolution kernels is to reduce the connections between the various layers of the network.
  • Subsampling is also called pooling. Subsampling can be seen as a special convolution process. Convolution and subsampling greatly simplify the model complexity and reduce the parameters of the model.
  • the convolutional neural network is composed of three parts, the first part is an input layer, the second part is composed of a combination of multiple convolution layers and multiple pooling layers, the third part is an output layer, and the output layer can be Consists of a fully connected multi-layer perceptron classifier.
  • a convolutional layer in a convolutional neural network can be used to convolve the input signal array and the weight array.
  • Convolutional neural networks can be widely used in speech recognition, face recognition, general object recognition, motion analysis, image processing, and the like.
  • the input signal as a two-dimensional matrix as an example, as shown in FIG. 6, it is assumed that the corresponding input feature of the image in a certain convolution layer includes three three-row and three-column signal matrices, and the convolution kernel includes 6 A weight matrix of two rows and two columns.
  • Two specific operations for convolution operations in a convolutional neural network are shown in Figure 6, one is a conventional convolution operation and the other is a matrix-converted convolution operation.
  • the traditional convolution operation performs matrix multiplication operation on each signal matrix and its corresponding weight matrix, and accumulates the results of the corresponding matrix multiplication operations to obtain two output signal matrices, that is, output features.
  • Another matrix-converted convolution operation transforms different signal matrices to obtain an input feature matrix that includes three signal matrices and a large matrix dimension.
  • the six weight matrices are also corresponding.
  • the conversion operation obtains a kernel matrix which includes 6 weight matrices and a large matrix dimension. After that, the input feature matrix and the kernel matrix obtained by the transform are subjected to matrix multiplication to obtain an output feature matrix.
  • the matrix multiplication operation after transformation requires a large computational cost. Therefore, it needs to be converted into a smaller fractal matrix by matrix partitioning, and multiplied by the fractal matrix to obtain the corresponding result, that is, multiply a large matrix. Split into multiples and accumulates of multiple fractal matrices.
  • the signal processing method can be processed in any intermediate layer in the hidden layer of the neural network.
  • the neural network may be a fully connected neural network, and the intermediate layer may also be referred to as a fully connected layer; or the neural network may also be a convolutional neural network, and the processing performed in the intermediate layer may specifically be a volume Processing in the convolutional layer in the neural network.
  • FIG. 7 is a schematic flowchart of a signal processing method according to an embodiment of the present disclosure.
  • the execution body of the method may be a device, and specifically may be a unit with a computing function in the device, such as a neural network processor, and the method includes the following Steps.
  • Step 701 Acquire a signal matrix and a weight matrix, and the number of columns of the signal matrix is equal to the number of rows of the weight matrix.
  • the signal matrix may be from an input layer of the neural network or an upper layer in the middle layer where the signal processing is located, and the input signal may be a voice signal, a text signal, an image signal, and a temperature signal, etc., and may be collected and processed.
  • the signal may be a matrix that is not matrix-converted, or may be a matrix-transformed matrix.
  • the signal matrix may be a two-dimensional matrix of M rows and K columns, and the matrix includes a plurality of computer-processable pending Signal, that is, each element corresponds to one signal.
  • the matrix before the signal matrix conversion may be a one-dimensional column vector, a one-dimensional row vector, a two-dimensional matrix (such as a grayscale image), and a three-dimensional matrix (such as an RGB color image).
  • the embodiment of the present application does not specifically limit this.
  • the weight matrix is composed of one weight coefficient, which may be defined by a neural network, the weight coefficient acts on the input signal, and the input signal corresponding to the large weight coefficient is strengthened in the process of neural network learning and training, and the weight coefficient is Small corresponding input signals are attenuated during the learning process.
  • the weight matrix may be a weight matrix that is not subjected to matrix transformation, or may be a matrix weight-converted weight matrix, and is a two-dimensional weight matrix of K rows and N columns.
  • Step 702 Block the signal matrix, obtain a plurality of first fractal signal matrices of X rows and H columns, and a block weight matrix, and obtain a plurality of first fractal weight matrices of H rows and Y columns, and a plurality of first fractal signal matrices and A plurality of first fractal weight matrices have a correspondence relationship, and each of the first fractal signal matrix and each of the first fractal weight matrices satisfy a non-approximate square.
  • the processor cannot directly operate on the matrix of the large dimension, so the signal matrix and the weight matrix need to be separately partitioned, and the partitioning of one matrix means that
  • the matrix is divided into a plurality of sub-blocks, each of which may be referred to as a fractal matrix.
  • the number of the plurality of first fractal signal matrices obtained by the block is equal to the number of the plurality of first fractal weight matrices, and the plurality of first fractal signal matrices and the plurality of first fractal weight matrices have a corresponding relationship.
  • the correspondence relationship may be a one-to-many relationship, or a many-to-one relationship, or a many-to-many relationship, that is, a first fractal signal matrix may correspond to a plurality of first fractal weight matrices, or a plurality of first fractals.
  • the signal matrix corresponds to a first fractal weight matrix, or the plurality of first fractal signals correspond to a plurality of first fractal weight matrices.
  • the number of columns of the first fractal signal matrix and the number of rows of the first fractal weight matrix are both H, that is, a first fractal signal matrix and its corresponding first fractal weight matrix satisfy a matrix multiplication rule, and the matrix multiplication rule refers to participation.
  • the number of columns of the first matrix multiplied by the matrix is equal to the number of rows of the second matrix participating in the matrix multiplication.
  • X, H, and Y are related to the number of rows and columns of the signal matrix, and the number of rows and columns of the weight matrix, and both the first fractal signal matrix and the first fractal weight matrix satisfy a non-approximate square.
  • FIG. 8 it is a schematic diagram of a matrix block.
  • the signal matrix is A
  • the weight matrix is B
  • the signal matrix A is divided into four first fractal signal matrices, which are respectively represented as A00, A01, A10, and A11
  • the weight matrix B is divided into four first fractal weight matrices, which are respectively represented as B00 and B01. B10 and B11.
  • the first fractal weight matrix corresponding to A00 and A10 respectively includes B00 and B01
  • the first fractal weight matrix corresponding to A01 and A11 respectively includes B01 and B10, and the corresponding first fractal weight.
  • the matrix includes B00 and B01.
  • the matrix C may be composed of four matrices C00, C01, C10, and C11, and the relationship between each constituent matrix in the matrix C and the first fractal signal matrix and the first fractal weight matrix may be as shown in the following formula (4).
  • each matrix block of the matrix C in the formula (4) can be performed in two steps. For example, taking C00 as an example, the following steps (I)-(II) can be performed, and the data can be reduced by repeatedly using the data C00_temp. The amount of reading and writing reduces the processor's bandwidth requirements while saving memory read and write power.
  • the processor performs the following steps (i)-(ii), and the processor only needs to acquire A00 once, and the processor-to-memory can be saved by multiplexing the number A00. Read and write power consumption.
  • the processor calculates the matrix multiplication operation of the signal matrix and the weight matrix
  • the signal matrix is divided into a plurality of first fractal signal matrices of X rows and H columns, and the weight matrix is divided into H rows.
  • each of the first fractal signal matrix and each of the first fractal weight matrices satisfy an approximate square shape, thereby improving the flexibility of the fractal matrix, and further based on the plurality of first fractal signal matrices
  • the processor can calculate the optimal design of the matrix multiplication operation of the signal matrix and the weight matrix according to the read and write power consumption of different data.
  • the absolute value of the difference between the number of rows and the number of columns satisfying the non-approximate square including the matrix is greater than or equal to 2. That is, the number of rows X and the number of columns H of the first fractal signal matrix satisfy
  • the fractal weight matrix may be a rectangular matrix whose difference between the number of rows and the number of columns is greater than or equal to 2, that is, the approximate square is not satisfied.
  • the signal matrix A is an M ⁇ K matrix
  • the weight matrix B is a K ⁇ N matrix
  • X, H, and Y are related to M, K, and N
  • the number of rows X and the number of columns H of the first fractal signal matrix may be As shown in Figure 9.
  • Step 703 Perform matrix multiplication and accumulation operations on the plurality of first fractal signal matrices and the plurality of first fractal weight matrices to obtain a plurality of matrix operation results, where the plurality of matrix operation results are used to form signal processing results, and each matrix
  • the operation result includes accumulation of a plurality of matrix multiplication results, and each matrix multiplication result is obtained by matrix multiplication of a first fractal signal matrix and a first fractal weight matrix.
  • the number of the plurality of first fractal signal matrices and the number of the plurality of first fractal weight matrices may be equal or not equal.
  • the first weight matrix of a fractal matrix performs a matrix multiplication operation to obtain a matrix multiplication result.
  • the plurality of first fractal signal matrices and the plurality of first fractal weight matrices are matrix multiplied and accumulated according to the correspondence between the plurality of first fractal signal matrices and the plurality of first fractal weight matrices.
  • the above calculation process may obtain an output matrix including a plurality of matrix multiplication results, a matrix operation result including accumulation of a plurality of matrix multiplication results, and each matrix multiplication operation result may include a plurality of computer processable output signals.
  • each first fractal signal matrix and each first fractal weight matrix are each taken as one element, matrix multiplication and accumulation operations of the plurality of first fractal signal matrices and the plurality of first fractal weight matrices, The calculation of the multiplication operation with two matrices containing multiple elements is similar.
  • the matrix C may be referred to as an output matrix
  • C00, C01, C10, and C11 are referred to as matrix operation results
  • the output matrix C includes four matrix operation results.
  • the product of A00 and B00 is a matrix multiplication result
  • the product of A01 and B10 is also a matrix multiplication result.
  • the two matrix multiplication results correspond to the position of C00 in the output matrix C, then the two The accumulation of matrix multiplication results is called the result of a matrix multiplication operation.
  • Step 704 Output a signal processing result, where the signal processing result includes a plurality of the matrix operation results.
  • the processor may further output a signal processing result, the signal processing result including a plurality of matrix operation results.
  • the output matrix composed of the plurality of matrix operation results may be a two-dimensional matrix (for example, a grayscale image), and the output signal corresponding to the output matrix may be a voice signal, a text signal, an image signal, and a temperature signal corresponding to the input signal.
  • the signal processing result may go to the middle layer of the middle layer of the signal processing layer or the output layer of the neural network.
  • the processor may include a Multiply-Accumulator (MAC) unit, a first buffer, a second buffer, and a third buffer, where the MAC unit in the processor may be associated with the first cache.
  • the second buffer and the third buffer directly interact, and the processor may further include a fourth buffer, where the first buffer and the second buffer are connected, the MAC unit may pass the first buffer and the second buffer Interact with the third buffer.
  • MAC Multiply-Accumulator
  • the MAC unit is configured to perform a specific multiply-add operation
  • the fourth buffer is configured to store a signal matrix and a weight matrix
  • the first buffer is configured to store a first fractal signal matrix of the signal matrix
  • the second buffer is configured to store weights
  • the third buffer is used to store the matrix multiplication result, or the accumulation of at least two matrix multiplication results, and the accumulation of the at least two matrix multiplication results may be a matrix operation result.
  • each unit in the above processors may be circuit hardware, including one or more of that are not limited to transistors, logic gates, or basic arithmetic units.
  • the signal matrix and the weight matrix may be matrices generated from calculations before the processor, or may be from other devices than the processor, such as hardware accelerators or other processors.
  • the processor of this embodiment is for acquiring a signal matrix and a weight matrix and performing calculations according to the method of the previous embodiment.
  • the specific operation process of the processor in FIG. 10 can refer to the previous method embodiment.
  • the block signal matrix obtains a plurality of first fractal signal matrices of the X rows and H columns
  • the method includes: the MAC unit reads the first fractal signal matrix of the X rows and H columns from the signal matrix by using the first buffer multiple times.
  • the processor may read the first fractal signal matrix of the X rows and H columns from the fourth buffer, and store the first fractal signal matrix of the X rows and H columns in the first buffer.
  • the block weight matrix obtains a plurality of first fractal weight matrices of the H rows and Y columns
  • the method includes: the MAC unit reads the first fractal weight matrix of the H rows and Y columns from the weight matrix by using the second buffer multiple times.
  • the processor may read the first fractal weight matrix of the H rows and Y columns from the fourth buffer, and store the first fractal weight matrix of the H rows and Y columns in the second buffer.
  • X is positively correlated with the first read/write power consumption of the first buffer
  • Y is positively correlated with the first read/write power consumption of the second buffer
  • H is respectively associated with the first read of the first buffer and the second buffer Write power is inversely related.
  • the positive correlation between X and the first read/write power consumption of the first buffer means that when X is larger, the first read/write power consumption of the first buffer is larger, and when X is smaller, the first of the first buffer is The lower the read and write power consumption, for example, X is proportional to the first read and write power consumption of the first buffer.
  • the positive correlation between Y and the first read/write power consumption of the second buffer means that when Y is larger, the first read/write power consumption of the second buffer is larger, and when Y is smaller, the first of the second buffer is The lower the read and write power consumption, for example, Y is proportional to the first read and write power consumption of the second buffer.
  • the inverse correlation between H and the first read/write power consumption of the first buffer and the second buffer means that when H is larger, the first read/write power consumption of the first buffer and the second buffer is smaller, when H The smaller the hour, the greater the first read and write power consumption of the first buffer and the second buffer, for example, H is inversely proportional to the first read and write power consumption of the first buffer and the second buffer.
  • a first fractal signal matrix and a matrix of a first fractal weight matrix are multiplied as an example to read a first fractal signal matrix of X rows and H columns and a first reading of H rows and Y columns for a MAC unit.
  • the relationship between X, Y, and H and the read and write power consumption of the first buffer and the second buffer are described in detail.
  • the MAC unit When the MAC unit reads the first fractal signal matrix, the MAC unit needs to first read a first fractal signal matrix of X rows and H columns from the signal matrix stored in the fourth buffer, that is, from the fourth buffer.
  • the first fractal signal matrix read by the device is written into the first buffer, and the first fractal signal matrix is read from the first buffer.
  • the MAC unit when the MAC unit reads the first fractal weight matrix, the MAC unit needs to first read the first fractal weight matrix of the H row and Y column from the weight matrix stored in the fourth buffer through the second buffer, that is, from the first buffer weight matrix.
  • the first fractal weight matrix read by the fourth buffer is written into the second buffer, and the first fractal weight matrix is read from the second buffer.
  • the MAC unit When the MAC unit performs the matrix multiplication operation, since the matrix multiplication operation is to multiply each of the first fractal weight matrix by each row in the first fractal signal matrix, the MAC unit needs to pass the X read operations from the first The X row of the first fractal signal matrix is read in a buffer, and the Y column of the first fractal weight matrix is read from the second buffer by Y read operations. It can be seen that when X is larger, the number of read operations of the first buffer is larger, and thus the first read/write power consumption of the first buffer is larger. When X is smaller, the number of read operations of the first buffer is larger.
  • the first read and write power consumption of the first buffer is smaller, so X is positively correlated with the first read and write power consumption of the first buffer.
  • Y is larger, the number of read operations of the second buffer is larger, and thus the first read/write power consumption of the second buffer is larger, and the smaller the Y is, the smaller the number of read operations of the second buffer is. Therefore, the second read/write power consumption of the second buffer is smaller, so Y is positively correlated with the first read/write power consumption of the second buffer.
  • the method may further include: writing the matrix multiplication result or the third buffer to the third buffer Accumulating at least two matrix multiplication results; and/or reading a matrix multiplication result or an accumulation of at least two matrix multiplication results from a third buffer.
  • X and Y are inversely related to the read/write power consumption of the third buffer, respectively, and H is positively correlated with the read/write power consumption of the third buffer.
  • X and Y are inversely related to the read/write power consumption of the third buffer, respectively.
  • the read/write power consumption of the third buffer is smaller.
  • the third buffer is used. The greater the read and write power consumption, for example, X and Y are inversely proportional to the read and write power consumption of the third buffer, respectively.
  • the first fractal signal matrix and the first fractal weight matrix of the H-row and Y-column are performed on the MAC unit by taking a matrix of a first fractal signal matrix and a first fractal weight matrix as an example.
  • the matrix multiplication operation the relationship between X, Y, and H and the read/write power consumption of the third buffer is described in detail.
  • the matrix multiplication operation is to multiply each row in the first fractal weight matrix by each row in the first fractal signal matrix, and perform row-column multiplication in the first row of the first fractal signal matrix (included in the first row) H row element) is multiplied by the first column in the first fractal weight matrix (including H column elements in the first row), and the MAC unit performs multiplication and addition operations on H row elements and H column elements.
  • the MAC unit first calculates the first product of the first row element and the first column element, the first product is written into the third buffer, and then the second row element and the second column element are calculated. After the product of the second product, the first product is read from the third buffer, the first product and the second product are accumulated and then written into the third buffer, and so on, until the H row elements and the H columns are calculated. The result of the multiplication and addition of elements.
  • H the number of times the MAC unit reads and writes to the third buffer is larger, and thus the read/write power consumption of the third buffer is larger.
  • the MAC unit is the third cache. The smaller the number of reads and writes, the smaller the read and write power consumption of the third buffer, so H is inversely related to the read and write power consumption of the third buffer.
  • the third buffer can be used to store the product of one row element and one column element.
  • the accumulation of the product or the product is also exemplary and does not constitute a limitation on the embodiments of the present application.
  • the third buffer is configured to store a matrix multiplication result, or at least two matrix multiplication results.
  • the first read/write power consumption of the first buffer herein includes power consumption of writing the first fractal signal matrix to the first buffer, and reading power consumption of the first fractal signal matrix from the first buffer;
  • the first read and write power consumption of the second buffer includes power consumption of writing the first fractal weight matrix to the second buffer, and reading power consumption of the first fractal weight matrix from the second buffer;
  • the read and write power consumption of the device includes writing and reading the matrix multiplication result to the third buffer or the accumulated power consumption of at least two matrix multiplication operations.
  • the processor may multiply the matrix by the result C00_temp after the execution of the step (I) is completed.
  • the result is stored in the third buffer.
  • the processor can read C00_temp from the third buffer and multiply it with the second matrix. The result is accumulated to obtain a matrix operation result C00, and C00 is stored in the third buffer.
  • the processor cannot perform operations on a first fractal signal matrix and a first fractal weight matrix at a time, and Further block processing to get the granularity that the processor can handle.
  • step (I) taking the matrix partition shown in FIG. 8 as an example, if the granularity of A00, A01, A10, and A11 after partitioning, and B00, B01, B10, and B11 are still large, for example, the processor cannot complete the above steps ( For the operation of I) or step (II), the calculation of step (I) is taken as an example, and the processor can further decompose it into the following formula (5).
  • the matrix A00 (00), A00 (01), A00 (10) and A00 (11) can be called the fractal matrix of A00, B00 (00), B00 (01), B00 (10) and B00 (11) can A fractal matrix called B00; correspondingly, the matrix C00 can be composed of C00(00), C00(01), C00(10), and C00(11).
  • a matrix multiplication operation is performed on a first fractal signal matrix and a corresponding first fractal weight matrix to obtain a matrix multiplication result, including: a block first fractal signal matrix, and obtaining x rows and h columns. a plurality of second fractal signal matrices, and a block first fractal weight matrix, to obtain a plurality of second fractal weight matrices of h rows and y columns, each second fractal signal matrix and each second fractal weight matrix satisfying non Approximating a square; performing matrix multiplication and accumulation operations on the plurality of second fractal signal matrices and the plurality of second fractal weight matrices to obtain a plurality of matrix operation results.
  • the processor when the processor further includes a first register, a second register, and a third register, the processor may interact with the first buffer through the first register, through the second register and the second buffer. Interact and interact with the third buffer through a third register.
  • the first register may be used to store the second fractal signal matrix, that is, to store the smallest fractal signal matrix, for example, for the storage A00(00), A00(01), A00(10) or A00 in FIG. (11).
  • the second register is configured to store the second fractal weight matrix, that is, to store the smallest fractal weight matrix, for example, for storing B00(00), B00(01), B00(10) or B00(11) in FIG. .
  • the third register is configured to store a matrix multiplication result or a cumulative result of at least two matrix multiplication operations in a matrix multiplication operation of the plurality of second fractal signal matrices and the plurality of second fractal weight matrices, for example, for storing FIG. 8 A00(00)B00(00) or A00(00)B00(01) in the middle.
  • the MAC unit when the MAC unit performs a matrix multiplication operation, the first fractal weight matrix read by the MAC unit is stored in the first buffer, and the read first fractal weight matrix is stored in the second buffer, and the MAC unit passes the first A register reads the second fractal signal matrix of the x rows and h columns from the first buffer, and reads the second fractal weight matrix of the x rows and h columns from the second buffer through the second register, respectively.
  • the MAC unit stores the matrix multiplication result in the matrix multiplication operation of the plurality of second fractal signal matrices and the plurality of second fractal weight matrices or the accumulation of the at least two matrix multiplication results in the third buffer through the third register, and / or, the matrix multiplication result or the accumulation of at least two matrix multiplication results is read from the third buffer.
  • x is positively correlated with the second read/write power consumption of the first buffer, y and the second buffer
  • the second read and write power consumption is positively correlated
  • h is inversely related to the second read and write power consumption of the first buffer and the second buffer.
  • x and y are inversely related to the read and write power consumption of the third buffer, respectively, and h is positively correlated with the read and write power consumption of the third buffer.
  • the second read/write power consumption of the first buffer herein includes: writing power consumption of the first fractal signal matrix to the first buffer, and reading the second memory from the first buffer by using the first register.
  • the power consumption of the fractal signal matrix; the second read and write power consumption of the second buffer includes writing power consumption of the first fractal weight matrix to the second buffer, and reading the second from the second buffer through the second register
  • the power consumption of the fractal weight matrix; the read and write power consumption of the third buffer includes writing and reading the matrix multiplication result to the third buffer through the third register or the accumulated power consumption of the at least two matrix multiplication operations.
  • M and K are the number of rows and columns of the signal matrix, respectively
  • K and N are the number of rows and columns of the weight matrix, respectively
  • G 1 , G 2 and G 3 are associated with M, N and K.
  • the corresponding X can be determined according to the principle of minimum power consumption, and correspondingly, Y and H are determined, thereby acquiring a plurality of first fractal signal matrices and a plurality of first fractals.
  • Weight matrix when performing matrix multiplication and accumulating operations, can achieve optimal design of processor power consumption. Because the power consumption parameters of different devices are different, how to optimize the power consumption for X, Y, and Z can be combined with the performance parameter understanding and actual test of the buffer, depending on the actual application scenario and device selection, this embodiment Do not expand too much on this.
  • the number of rows and columns of the first fractal signal matrix and the number of rows and columns of the first fractal weight matrix may also be determined according to different buffer capacities, processor power consumption, bandwidth of different buffers, and the like. Therefore, when the output matrix is determined according to the plurality of first fractal signal matrices and the plurality of first fractal weight matrices, the capacity and bandwidth of different buffers can be fully utilized while reducing the power consumption of the processor as much as possible.
  • the system power consumption has a certain relationship with various parameters such as the number of matrix rows, the number of columns, the number of reading and writing of each buffer, or the performance of each buffer.
  • the present embodiment devises a related method and apparatus for performing calculations by using a fractal signal matrix and a fractal weight matrix satisfying a non-approximate square, without strictly limiting the fractal signal matrix and the fractal weight matrix to a square. Improve design flexibility and adapt to different read and write requirements for the buffer.
  • the signal processing method provided by the embodiment of the present application is mainly introduced from the perspective of the device. It can be understood that the device includes corresponding hardware structures and/or software modules for performing various functions in order to implement the above functions.
  • the present application can be implemented in a combination of hardware or hardware and computer software in conjunction with the network elements and algorithm steps of the various examples described in the embodiments disclosed herein. Whether a function is implemented in hardware or computer software to drive hardware depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present application.
  • the embodiment of the present application may divide the function module into the signal processing device according to the foregoing method example.
  • each function module may be divided according to each function, or two or more functions may be integrated into one processing module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules. It should be noted that the division of the module in the embodiment of the present application is schematic, and is only a logical function division, and the actual implementation may have another division manner.
  • FIG. 12 is a schematic diagram showing a possible structure of a signal processing apparatus involved in the foregoing embodiment, the signal processing apparatus includes: an obtaining unit 1201, a processing unit 1202, and Output unit 1203.
  • the obtaining unit 1201 is configured to support the signal processing device to perform step 701 in FIG. 7
  • the processing unit 1202 is configured to support the signal processing device to perform steps 702 and 703 in FIG. 7, and/or for the techniques described herein Other processes
  • a signal processing apparatus in the embodiment of the present application is described above from the perspective of a modular functional entity.
  • a signal processing apparatus in the embodiment of the present application is described below from the perspective of processor hardware processing.
  • the embodiment of the present application provides a signal processing apparatus.
  • the structure of the apparatus may be as shown in FIG. 2.
  • the signal processing apparatus includes a memory 201, a processor 202, a communication interface 203, and a bus 204.
  • the communication interface 203 may include an input interface 2031 and an output interface 2032.
  • the input interface 2031 is configured to obtain a signal matrix and/or a weight matrix, wherein the input interface can implement a switch to acquire a signal matrix and obtain a weight matrix; in some feasible embodiments, the input interface can be used to divide Obtaining the above-mentioned signal matrix or weight matrix in a time-multiplexed manner; in some feasible embodiments, the input interface may have two, respectively acquiring the signal matrix and the weight matrix, for example, acquiring the signal matrix and the weight matrix simultaneously .
  • Processor 202 is configured to process the functions of steps 702-step 703 of the signal processing method described above.
  • the processor may be a single processor architecture, a multi-processor architecture, a single-threaded processor, a multi-threaded processor, etc., and in some possible embodiments, the processor may be integrated in a dedicated integration. In the circuit, it may also be a processor chip that is independent of the integrated circuit.
  • the output interface 2032 is configured to output the signal processing result in the signal processing method.
  • the signal processing result may be directly output by the processor, or may be stored in the memory first, and then Memory output; in some possible embodiments, there may be only one output interface or multiple output interfaces.
  • the signal processing result output by the output interface may be sent to the memory for storage, or may be sent to the next signal processing device for processing, or sent to the display device for display and sent to the player terminal for processing. Play and so on.
  • the memory 201 can store the above-mentioned signal matrix, signal processing result, weight matrix, related instructions for configuring the processor, and the like.
  • the memory may be a floppy disk, such as a built-in hard disk and a mobile hard disk, a magnetic disk, an optical disk, a magneto-optical disk such as CD_ROM, DCD_ROM, non-volatile storage.
  • Devices such as RAM, ROM, PROM, EPROM, EEPROM, flash memory, or any other form of storage medium known in the art.
  • the embodiment of the present application further provides a computer readable storage medium, wherein the computer readable storage medium stores instructions when it is run on a device (for example, the device may be a single chip, a chip, a computer, etc.)
  • the apparatus performs one or more of steps 701-704 of the signal processing method described above.
  • the constituent modules of the above signal processing device may be stored in the computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the embodiment of the present application further provides a computer program product including instructions, and the technical solution of the present application may contribute to the prior art or all or part of the technical solution may be a software product.
  • the computer software product is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor therein to perform various embodiments of the present application. All or part of the steps of the method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

本申请提供一种信号处理方法及装置,涉及计算机技术领域,用于提高分形矩阵的灵活性。该方法应用于包含处理器的设备中,包括:获取信号矩阵和权重矩阵,信号矩阵和权重矩阵均为二维矩阵,信号矩阵的列数与权重矩阵的行数相等;分块信号矩阵,得到X行H列的多个第一分形信号矩阵,以及分块权重矩阵,得到H行Y列的多个第一分形权重矩阵,每个第一分形信号矩阵和每个第一分形权重矩阵均满足非近似正方形;将多个第一分形信号矩阵和多个第一分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果,所述多个矩阵运算结果用于形成信号处理结果。

Description

一种信号处理方法及装置
本申请要求于2017年12月29日提交中国专利局、申请号为201711481199.4、申请名称为“一种信号处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,尤其涉及一种信号处理方法及装置。
背景技术
神经网络(Neural Network,NN),是一种模仿动物神经网络行为特征进行信息处理的网络结构。该结构由大量的节点(或称神经元)相互联接构成,基于特定运算模型通过对输入信息进行学习和训练达到处理信息的目的。一个神经网络包括输入层、隐藏层及输出层,输入层负责接收输入信号,输出层负责输出神经网络的计算结果,隐藏层负责学习、训练等计算过程,是网络的记忆单元,隐藏层的记忆功能由权重矩阵来表征,通常每个神经元对应一个权重系数。
其中,卷积神经网络(Convolutional Neural Network,CNN)是一种多层的神经网络,每层有多个二维平面组成,而每个平面由多个独立神经元组成,每个平面的多个神经元共享权重,通过权重共享可以降低神经网络中的参数数目。目前,在卷积神经网络中,处理器进行卷积操作通常是将输入信号特征与权重的卷积,转换为信号矩阵与权重矩阵之间的矩阵乘运算。在具体矩阵乘运算时,通常是根据条件|row-columns|≤1(也即是,|行数-列数|≤1,即矩阵的行数与列数的差值的绝对值小于或等于1),对信号矩阵和权重矩阵进行分形处理,得到多个近似于正方形的分形(Fractional)信号矩阵和分形权重矩阵,然后对多个分形信号矩阵和分形权重矩阵进行矩阵乘和累加运算。比如,如图1所示,假设C=AB,A为信号矩阵,B为权重矩阵,则进行矩阵乘运算时,由于处理器可能缺少对A和B这种大矩阵进行计算的能力或进行此类计算代价较大,可以将矩阵A根据条件划分为A00、A01、A10和A11,将矩阵B根据条件划分为B00、B01、B10和B11,相应的矩阵C可以由C00、C01、C10和C11四个矩阵块组成,矩阵C中每一矩阵块与分形信号矩阵和分形权重矩阵的关系可以如下公式所示。
C00=A00B00+A01B10
C01=A00B01+A01B11
C10=A10B00+A11B10
C11=A10B01+A11B11
上述方法中,在对矩阵C中的每个矩阵块进行计算时,可以通过数据复用的方式进行计算以减少功耗,例如,C00和C01的计算复用了数据A00,降低了读取数据A00的功耗开销。但是,根据条件|row-columns|≤1对信号矩阵和权重矩阵进行分形处理,得到的分形信号矩阵和分形权重矩阵的形状是固定的,其消耗的功耗也是固定的,设计灵活性不足。
发明内容
本发明的实施例提供一种信号处理方法及装置,用于提高分形矩阵的灵活性。
为达到上述目的,本发明的实施例采用如下技术方案:
第一方面,提供一种信号处理方法,应用于包含处理器的设备中,该方法包括:获取信号矩阵和权重矩阵,信号矩阵为二维矩阵且包括多个计算机可处理的待处理信号,权重矩阵为二维矩阵且包括多个权重系数,信号矩阵的列数与权重矩阵的行数相等;分块信号矩阵,得到X行H列的多个第一分形信号矩阵,以及分块权重矩阵,得到H行Y列的多个第一分形权重矩阵,每个第一分形信号矩阵和每个第一分形权重矩阵均满足非近似正方形;将多个第一分形信号矩阵和多个第一分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果,所述多个矩阵运算结果用于形成信号处理结果,每个矩阵运算结果包括多个矩阵乘结果的累加,每个矩阵乘结果由一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算得到。可选地,所述方法还包括:输出信号处理结果,该信号处理结果包括所述多个矩阵运算结果。
上述技术方案中,处理器在获取到信号矩阵和权重矩阵时,分块信号矩阵和权重矩阵,得到X行H列的多个第一分形信号矩阵和H行Y列的多个第一分形权重矩阵,由于每个第一分形信号矩阵和每个第一分形权重矩阵均满足非近似正方形,因此,提高了分形矩阵的灵活性,以利于功耗优化设计。
在第一方面的一种可能的实现方式中,满足非近似正方形包括,矩阵的行数与列数的差值的绝对值大于或等于2,即第一分形信号矩阵和第一分形权重矩阵的行数与列数的差值的绝对值均大于或等于2。
在第一方面的一种可能的实现方式中,处理器包括第一缓存器和第二缓存器,分块信号矩阵,得到X行H列的多个第一分形信号矩阵,以及分块权重矩阵,得到H行Y列的多个第一分形权重矩阵,包括:通过第一缓存器分别多次从信号矩阵中读取X行H列的多个第一分形信号矩阵;通过第二缓存器分别多次从所述权重矩阵中读取H行Y列的多个第一分形权重矩阵。上述可能的技术方案中,处理器可以通过第一缓存器读取到非近似正方形的第一分形信号矩阵,通过第二缓存器读取到非近似正方形的第一分形权重矩阵,从而可以提供第一缓存器和第二缓存器读取的分形矩阵的灵活性。
在第一方面的一种可能的实现方式中,处理器还包括第三缓存器,该方法还包括:向第三缓存器中写入矩阵乘结果或至少两个矩阵乘结果的累加。
在第一方面的一种可能的实现方式中,对于一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算,得到一个矩阵乘结果,包括:分块第一分形信号矩阵,得到x行h列的多个第二分形信号矩阵,以及分块第一分形权重矩阵,得到h行y列的多个第二分形权重矩阵,每个第二分形信号矩阵和每个第二分形权重矩阵均满足非近似正方形;将多个第二分形信号矩阵和多个第二分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果。上述可能的技术方案中,当处理器一次无法计算一个第一分形信号矩阵与一个第一分形权重矩阵的矩阵乘运算时,还可以进一步将其分为多个较小的非近似正方形的第二分形信号矩阵和第二权重矩阵,通过多个第二分形信号矩阵和多个第二权重矩阵进行矩阵乘和累加运算,从而可以进一步提高分形矩阵的灵 活性。
第二方面,提供一种信号处理装置,该装置包括:获取单元,用于获取信号矩阵和权重矩阵,信号矩阵为二维矩阵且包括多个计算机可处理的待处理信号,权重矩阵为二维矩阵且包括多个权重系数,信号矩阵的列数与权重矩阵的行数相等;处理单元,用于分块信号矩阵,得到X行H列的多个第一分形信号矩阵,以及分块权重矩阵,得到H行Y列的多个第一分形权重矩阵,每个第一分形信号矩阵和每个第一分形权重矩阵均满足非近似正方形,以及将多个第一分形信号矩阵和多个第一分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果,所述多个矩阵运算结果用于形成信号处理结果,每个矩阵运算结果包括多个矩阵乘结果的累加,每个矩阵乘结果由一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算得到。可选地,该装置还包括输出单元,用于输出信号处理结果,该信号处理结果包括所述多个矩阵运算结果。
在第二方面的一种可能的实现方式中,满足非近似正方形包括,矩阵的行数与列数的差值的绝对值大于或等于2。
在第二方面的一种可能的实现方式中,处理单元包括第一缓存器和第二缓存器,处理单元具体用于:通过第一缓存器分别多次从信号矩阵中读取X行H列的多个第一分形信号矩阵;通过第二缓存器分别多次从所述权重矩阵中读取H行Y列的多个第一分形权重矩阵。
在第二方面的一种可能的实现方式中,处理单元还包括第三缓存器,处理单元还用于:向第三缓存器中写入矩阵乘运算结果或至少两个矩阵乘运算结果的累加。
在第二方面的一种可能的实现方式中,对于一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算,处理单元还用于:分块第一分形信号矩阵,得到x行h列的多个第二分形信号矩阵,以及分块第一分形权重矩阵,得到h行y列的多个第二分形权重矩阵,每个第二分形信号矩阵和每个第二分形权重矩阵均满足非近似正方形;将多个第二分形信号矩阵和多个第二分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果。
第三方面,提供一种信号处理装置,该装置包括:输入接口,用于获取信号矩阵和权重矩阵,信号矩阵为二维矩阵,权重矩阵为二维矩阵且包括多个权重系数,信号矩阵的列数与权重矩阵的行数相等;处理器,被配置为可处理如下操作:分块信号矩阵,得到X行H列的多个第一分形信号矩阵,以及分块权重矩阵,得到H行Y列的多个第一分形权重矩阵,每个第一分形信号矩阵和每个第一分形权重矩阵均满足非近似正方形;将多个第一分形信号矩阵和多个第一分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果,所述多个矩阵运算结果用于形成信号处理结果,每个矩阵运算结果包括多个矩阵乘结果的累加,每个矩阵乘结果由一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算得到。可选地,该装置还包括输出接口,用于输出信号处理结果,信号处理结果包括所述多个矩阵运算结果。
在第三方面的一种可能的实现方式中,满足非近似正方形包括:矩阵的行数与列数的差值的绝对值大于或等于2。
在第三方面的一种可能的实现方式中,处理器包括第一缓存器和第二缓存器,处理器还执行以下操作:通过第一缓存器分别多次从信号矩阵中读取X行H列的多个第 一分形信号矩阵;通过第二缓存器分别多次从权重矩阵中读取H行Y列的多个第一分形权重矩阵。
在第三方面的一种可能的实现方式中,处理器还包括第三缓存器,处理器还执行以下操作:向第三缓存器中写入矩阵乘结果或至少两个矩阵乘结果的累加。
在第三方面的一种可能的实现方式中,对于一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算,处理器还执行以下操作:分块第一分形信号矩阵,得到x行h列的多个第二分形信号矩阵,以及分块第一分形权重矩阵,得到h行y列的多个第二分形权重矩阵,每个第二分形信号矩阵和每个第二分形权重矩阵均满足非近似正方形;将多个第二分形信号矩阵和多个第二分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果。
本申请的又一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得该计算机执行上述第一方面或第一方面的任一种可能的实现方式所提供的信号处理方法。
本申请的又一方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得该计算机执行上述第一方面或第一方面的任一种可能的实现方式所提供的信号处理方法。
本申请的又一方面,提供了一种处理器,该处理器用于:获取信号矩阵和权重矩阵,所述信号矩阵为二维矩阵且包括多个计算机可处理的待处理信号,所述权重矩阵为二维矩阵且包括多个权重系数,所述信号矩阵的列数与所述权重矩阵的行数相等;分块所述信号矩阵,得到X行H列的多个第一分形信号矩阵,以及分块所述权重矩阵,得到H行Y列的多个第一分形权重矩阵,每个第一分形信号矩阵和每个第一分形权重矩阵均满足非近似正方形;将所述多个第一分形信号矩阵和所述多个第一分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果,所述多个矩阵运算结果用于形成信号处理结果,每个矩阵运算结果包括多个矩阵乘结果的累加,每个矩阵乘结果由一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算得到。
在一种可能的实现方式中,所述满足所述非近似正方形包括:矩阵的行数与列数的差值的绝对值大于或等于2。
在一种可能的实现方式中,所述处理器包括第一缓存器和第二缓存器,所述处理器还执行以下操作:通过所述第一缓存器分别多次从所述信号矩阵中读取X行H列的多个第一分形信号矩阵;通过所述第二缓存器分别多次从所述权重矩阵中读取H行Y列的多个第一分形权重矩阵。
在一种可能的实现方式中,所述处理器还包括第三缓存器,所述处理器还执行以下操作:向所述第三缓存器中写入所述矩阵乘结果或至少两个所述矩阵乘结果的累加。
在一种可能的实现方式中,所述处理器还执行以下操作:分块所述第一分形信号矩阵,得到x行h列的多个第二分形信号矩阵,以及分块所述第一分形权重矩阵,得到h行y列的多个第二分形权重矩阵,每个第二分形信号矩阵和每个第二分形权重矩阵均满足所述非近似正方形;将所述多个第二分形信号矩阵和所述多个第二分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果。
在一种可能的实现方式中,所述处理器包括用于执行之前所述计算处理的计算单 元。可选地,所述计算单元包括乘累加单元。所述乘累加单元是用于执行乘累加运算的硬件。
可以理解地,上述提供的任一种信息处理方法的装置、计算机存储介质或者计算机程序产品均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
附图说明
图1为一种矩阵分块的示意图;
图2为本发明实施例提供的一种设备的结构示意图;
图3为本发明实施例提供的一种神经网络的结构示意图;
图4为本发明实施例提供的一种全连接神经网络的结构示意图;
图5为本发明实施例提供的一种卷积神经网络的结构示意图;
图6为本发明实施例提供的一种卷积操作的示意图;
图7为本发明实施例提供的一种信号处理方法的流程示意图;
图8为本发明实施例提供的一种矩阵分块的示意图;
图9为本发明实施例提供的一种第一分形信号矩阵的示意图;
图10为本发明实施例提供的一种处理器的结构示意图;
图11为本发明实施例提供的另一种处理器的结构示意图;
图12为本发明实施例提供的一种信号处理装置的结构示意图。
具体实施方式
图2为本申请实施例提供的一种设备的结构示意图,参见图2,该设备可以包括存储器201、处理器202、通信接口203和总线204。其中,存储器201、处理器202以及通信接口203通过总线204相互连接。存储器201可用于存储数据、软件程序以及模块,主要包括存储程序区和存储数据区,存储程序区可存储操作***、至少一个功能所需的应用程序等,存储数据区可存储该设备的使用时所创建的数据等。处理器202用于对该设备的动作进行控制管理,比如通过运行或执行存储在存储器201内的软件程序和/或模块,以及调用存储在存储器201内的数据,执行该设备的各种功能和处理数据。通信接口203用于支持该设备进行通信。
其中,处理器202可以包括中央处理器单元,通用处理器,数字信号处理器,专用集成电路,现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。总线204可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线,或者扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图2中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
如图3所示,是一种神经网络的结构示意图,该神经网络300具有N个处理层,N≥3且N取自然数,该神经网络的第一层为输入层301,负责接收输入信号,该神经网络的最后一层为输出层303,输出神经网络的处理结果,除去第一层和最后一层的 其他层为中间层304,这些中间层共同组成隐藏层302,隐藏层中的每一层中间层既可以接收输入信号,也可以输出信号,隐藏层负责输入信号的处理过程。每一层代表了信号处理的一个逻辑级别,通过多个层,数据信号可经过多级逻辑的处理。
为便于理解,下面对本申请实施例中神经网络的处理原理进行描述,神经网络的处理通常是非线性函数f(x i),如f(x i)=max(0,x i),在一些可行的实施例中,该处理函数可以是修正线性单元(Rectified Linear Units,ReLU)、双曲正切函数(tanh)或S型函数(sigmoid)等。假设(x 1,x 2,x 3)是一个一维信号矩阵,(h 1,h 2,h 3)是输出信号矩阵,W ij表示输入x j与输出h i之间的权重系数,权重系数构成的矩阵为权重矩阵,则该一维信号矩阵与输出信号矩阵对应的权重矩阵W如式(1)所示:
Figure PCTCN2018099733-appb-000001
输入信号与输出信号的关系如式(2)所示,其中b i为神经网络处理函数的偏置值,该偏置值对神经网络的输入进行调整从而得到理想的输出结果。
Figure PCTCN2018099733-appb-000002
在一些可行的实施例中该神经网络的输入信号可以是语音信号、文本信号、图像信号、温度信号等各种形式的信号,该语音信号可以是录音设备录制的语音信号、移动手机或固定电话在通话过程中接收的语音信号、以及收音机接收的电台发送的语音信号等,文本信号可以是TXT文本信号、Word文本信号、以及PDF文本信号等,图像信号可以是相机拍摄的风景信号、显监控设备捕捉的社区环境的图像信号以及门禁***获取的人脸的面部信号等,该神经网络的输入信号包括其他各种计算机可处理的工程信号,在此不再一一列举。该神经网络的隐藏层302进行的处理可以是去除语音信号中混杂的噪音信号从而增强语音信号、对文本信号中的特定内容进行理解、以及对人脸的面部图像信号进行识别等处理。
神经网络的每个层可以包括多个节点,也可以称为神经元。全连接神经网络是一种相邻层之间各神经元全连接的神经网络,即前一层中的全部神经元与后一层中的每个神经元都连接。示例性的,图4是一种包含三层的全连接神经网络的结构示意图,层1和层2均包括四个神经元,层3包括一个神经元。图4中“+1”表示偏置神经元,用于对神经网络中每一层的输入进行调整。由于全连接网络的相邻层中神经元是全连接的,当全连接神经网络的中间层较多时,则越靠后的处理层中的信号矩阵和权重矩阵的维度会很庞大,从而导致神经网络的网络尺寸过于庞大。
卷积神经网络可以采用较小的参数模板在输入信号空间域上滑动滤波,从而解决全连接神经网络中网络尺寸过于庞大的问题。卷积神经网络与普通神经网络的区别在于,卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。在卷积神经网络的卷积层中,一个神经元只与部分邻层神经元连接。在卷积神经网络的一个卷积层中,通常包含若干个特征平面,每个特征平面由一些矩形排列的神经元组成,同一特征平面的神经元共享权值,这里共享的权值就是卷积核。卷 积核一般以随机小数矩阵的形式初始化,在网络的训练过程中卷积核将学习得到合理的权值。卷积核带来的直接好处是减少网络各层之间的连接。子采样也叫做池化,子采样可以看作一种特殊的卷积过程,卷积和子采样大大简化了模型复杂度,减少了模型的参数。如图5所示,卷积神经网络由三部分构成,第一部分是输入层,第二部分由多个卷积层和多个池化层的组合组成,第三部分是输出层,输出层可以由一个全连接的多层感知机分类器构成。
卷积神经网络中的卷积层可用于对输入信号阵列和权重阵列进行卷积操作。具体的,这里以一维输入信号为例,假设输入信号为f(u),u=0~N-1,卷积核为h(v),v=0~n-1,n≤N,则卷积运算可以通过以下公式(3)来描述。
Figure PCTCN2018099733-appb-000003
卷积神经网络可以广泛应用于语音识别、人脸识别、通用物体识别、运动分析、图像处理等。示例性的,以输入信号为二维矩阵为例,如图6所示,假设该图像在某一卷积层中对应的输入特征包括3个三行三列的信号矩阵,卷积核包括6个二行二列的权重矩阵。图6中示出了卷积神经网络中进行卷积操作的两种具体操作方式,一种是传统卷积操作,另一种是矩阵变换后的卷积操作。其中,传统卷积操作是将每个信号矩阵与其对应的权重矩阵进行矩阵乘运算,并将对应的矩阵乘运算的结果进行累加,得到两个输出信号矩阵,即输出特征。另一种矩阵变换后的卷积操作,对不同的信号矩阵进行了转换,得到一个同时包括3个信号矩阵且矩阵维度较大的输入特征矩阵;同理,对6个权重矩阵也进行相应的转换操作,得到一个同时包括6个权重矩阵且矩阵维度较大的核矩阵;之后,通过变换得到的输入特征矩阵和核矩阵进行矩阵乘运算,得到输出特征矩阵。
通过对信号矩阵和权重矩阵进行矩阵变换,可以减小矩阵乘的操作次数,进而减小读取信号矩阵和权重矩阵的开销。但是,变换后的矩阵乘运算需要的计算开销较大,因此,需要通过矩阵分块将其转换为较小的分形矩阵,并通过分形矩阵相乘得到相应的结果,即将一个大的矩阵相乘拆分为多个分形矩阵的相乘和累加。
为了便于理解,下面对本申请实施例中一种具体的信号处理方法进行描述,该信号处理方法可以在神经网络的隐藏层中的任一个中间层中进行处理。可选的,该神经网络可以是全连接神经网络,该中间层也可以称为全连接层;或者,该神经网络也可以是卷积神经网络,该中间层中进行的处理具体可以是在卷积神经网络中的卷积层中处理。
图7为本申请实施例提供的一种信号处理方法的流程示意图,该方法的执行主体可以是设备,具体可以是设备中具有计算功能的单元,比如神经网络处理器等,该方法包括以下几个步骤。
步骤701:获取信号矩阵和权重矩阵,信号矩阵的列数与权重矩阵的行数相等。
其中,信号矩阵可以来自神经网络的输入层或者信号处理所在中间层中的上一层中间层,该输入信号可以是语音信号、文本信号、图像信号以及温度信号等各种可以被采集并且被处理的信号,该矩阵可以是未进行矩阵转换的矩阵,也可以是经过矩阵转换后的矩阵,该信号矩阵可以是M行K列的二维矩阵,且矩阵中包括多个计算机可 处理的待处理信号,即每个元素对应一个信号。当该信号矩阵是经过转换后的矩阵时,该信号矩阵转换前的矩阵可以是一维列向量、一维行向量、二维矩阵(比如灰度图像)、以及三维矩阵(比如RGB彩色图像)等,本申请实施例对此不作具体限定。
另外,权重矩阵由一个个权重系数构成,该权重矩阵可以是由神经网络定义的,权重系数作用于输入信号,权重系数大对应的输入信号在神经网络学习训练的过程中会被加强,权重系数小对应的输入信号在学习训练的过程中会被减弱。该权重矩阵可以是未进行矩阵转换的权重矩阵,也可以是经过矩阵转换后的权重矩阵,且为K行N列的二维权重矩阵。
步骤702:分块信号矩阵,得到X行H列的多个第一分形信号矩阵,以及分块权重矩阵,得到H行Y列的多个第一分形权重矩阵,多个第一分形信号矩阵与多个第一分形权重矩阵存在对应关系,每个第一分形信号矩阵和每个第一分形权重矩阵均满足非近似正方形。
其中,由于原始的信号矩阵和权重矩阵通常都具有较大的维度,处理器无法直接对大维度的矩阵进行运算,因此需要对信号矩阵和权重矩阵分别进行分块,分块一个矩阵是指将矩阵分成多个子块,每个子块可以称为一个分形矩阵。分块得到的多个第一分形信号矩阵的个数与多个第一分形权重矩阵的个数相等,且多个第一分形信号矩阵与多个第一分形权重矩阵之间存在对应关系,该对应关系可以是一对多的关系,也可以是多对一的关系,或者是多对多的关系,即一个第一分形信号矩阵可以对应多个第一分形权重矩阵,或者多个第一分形信号矩阵对应一个第一分形权重矩阵、或者多个第一分形信号对应多个第一分形权重矩阵。
另外,第一分形信号矩阵的列数和第一分形权重矩阵的行数均为H,即一个第一分形信号矩阵与其对应的第一分形权重矩阵满足矩阵乘规则,该矩阵乘规则是指参加矩阵乘的第一个矩阵的列数等于参加矩阵乘的第二个矩阵的行数。X、H和Y与信号矩阵的行数和列数、以及权重矩阵的行数和列数有关,且第一分形信号矩阵和第一分形权重矩阵均满足非近似正方形。
如图8所示,为一种矩阵分块的示意图。假设信号矩阵为A、权重矩阵为B、矩阵C=AB。示例性的,信号矩阵A分块得到4个第一分形信号矩阵,分别表示为A00、A01、A10和A11,权重矩阵B分块得到4个第一分形权重矩阵,分别表示为B00、B01、B10和B11。以A00、A01、A10和A11为例,则A00和A10分别对应的第一分形权重矩阵包括B00和B01,A01和A11分别对应的第一分形权重矩阵包括B01和B10,对应的第一分形权重矩阵包括B00和B01。矩阵C可以由4个矩阵C00、C01、C10和C11组成,矩阵C中的每个组成矩阵与第一分形信号矩阵和第一分形权重矩阵的关系可以如下公式(4)所示。
Figure PCTCN2018099733-appb-000004
其中,公式(4)中矩阵C的每个矩阵块的计算可以分两步执行,比如以C00为例,可以按照如下步骤(I)-(II)来执行,通过重复利用数据C00_temp可以减少数据的 读写量,降低处理器对带宽的需求,同时节省内存的读写功耗。
C00_temp=A00B00        (I)
C00_temp=C00_temp+A01B10      (II)
或者,公式(4)中矩阵C的计算中按照如下步骤(i)-(ii)的示例执行,则处理器只需获取一次A00,则可以通过复用数量A00的方式,节省处理器对内存的读写功耗。
C00_temp=A00B00         (i)
C01_temp=A00B01          (ii)
本申请实施例中,当处理器计算信号矩阵与权重矩阵的矩阵乘运算时,通过将信号矩阵分块为X行H列的多个第一分形信号矩阵,以及将权重矩阵分块为H行Y列的多个第一分形权重矩阵,每个第一分形信号矩阵和每个第一分形权重矩阵均满足近似正方形,从而可以提高了分形矩阵的灵活性,进而基于多个第一分形信号矩阵和多个第一分形权重矩阵进行矩阵乘和累加运算时,可以根据不同数据的读写功耗,实现处理器计算信号矩阵与权重矩阵的矩阵乘运算的最优设计。
可选的,满足非近似正方形包括矩阵的行数与列数的差值的绝对值大于或等于2。即第一分形信号矩阵的行数X和列数H满足|X-H|≥2,第一分形权重矩阵的行数H和列数Y满足|H-Y|≥2,即第一分形信号矩阵和第一分形权重矩阵均可以为行数与列数之间的差值大于或等于2的长方形矩阵,即不满足近似正方形。示例性的,假设信号矩阵A为M×K矩阵、权重矩阵B为K×N矩阵,X、H和Y与M、K和N有关,第一分形信号矩阵的行数X和列数H可以如图9所示。
步骤703:将多个第一分形信号矩阵和多个第一分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果,所述多个矩阵运算结果用于形成信号处理结果,每个矩阵运算结果包括多个矩阵乘结果的累加,每个矩阵乘结果由一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算得到。
其中,在上面的描述中已经提到过,多个第一分形信号矩阵的个数与多个第一分形权重矩阵的个数可以相等,也可以不相等。多个第一分形信号矩阵与多个第一分形权重矩阵之间可以存在对应关系,第一分形信号矩阵与第一分形权重矩阵之间满足矩阵乘规则,一个第一分形矩阵与对应于该第一分形矩阵的第一权重矩阵进行矩阵乘运算得到一个矩阵乘运算结果。因此,根据多个第一分形信号矩阵与多个第一分形权重矩阵的对应关系,将多个第一分形信号矩阵和多个第一分形权重矩阵进行矩阵乘和累加运算。以上计算过程可以得到包括多个矩阵乘运算结果的输出矩阵,一个矩阵运算结果包括多个矩阵乘结果的累加,每个矩阵乘运算结果可以包括多个计算机可处理的输出信号。
需要说明的是,如果将每个第一分形信号矩阵和每个第一分形权重矩阵各作为一个元素,则多个第一分形信号矩阵和多个第一分形权重矩阵的矩阵乘和累加运算,与包含多个元素的两个矩阵之间的相乘运算的计算方式类似。
为便于理解,这里以上述公式(4)中的为例进行说明,可以将矩阵C称为输出矩阵,C00、C01、C10和C11称为矩阵运算结果,输出矩阵C包括四个矩阵运算结果。 以C00为例,A00与B00的乘积为一个矩阵乘结果,A01与B10的乘积也为一个矩阵乘结果,这两个矩阵乘结果在输出矩阵C中都对应C00的位置,则将这两个矩阵乘结果的累加称为一个矩阵乘运算结果。
步骤704:输出信号处理结果,该信号处理结果包括多个所述矩阵运算结果。
当获取多个矩阵乘运算结果之后,处理器还可以输出信号处理结果,该信号处理结果包括多个矩阵运算结果。该多个矩阵运算结果组成的输出矩阵可以是二维矩阵(比如,灰度图像),该输出矩阵对应的输出信号可以是与输入信号对应的语音信号、文本信号、图像信号以及温度信号等各种可以被处理或者可以被播放、显示的信号。可选的,该信号处理结果可以去往信号处理所在中间层的下一层中间层或者神经网络的输出层。
进一步的,如图10所示,处理器可以包括乘累加(Multiply-Accumulator,MAC)单元、第一缓存器、第二缓存器和第三缓存器,处理器中的MAC单元可以与第一缓存器、第二缓存器和第三缓存器直接进行交互,处理器还可以包括第四缓存器,对于第一缓存器和第二缓存器相连,MAC单元可以通过第一缓存器和第二缓存器与第三缓存器进行交互。其中,MAC单元用于执行具体的乘加运算,第四缓存器可用于存储信号矩阵和权重矩阵,第一缓存器可用于存储信号矩阵的第一分形信号矩阵,第二缓存器可用于存储权重矩阵的第一分形权重矩阵,第三缓存器用于存储矩阵乘结果、或者至少两个矩阵乘结果的累加,至少两个矩阵乘结果的累加可以是一个矩阵运算结果。
例如,以上处理器中的各个单元可以电路硬件,包括电不限于晶体管、逻辑门、或基本运算单元等的一个或多个。再例如,信号矩阵和权重矩阵可以是来自处理器之前计算所生成的矩阵,也可以来自处理器之外的其他设备,如硬件加速器或其他处理器等。本实施例的处理器用于获取信号矩阵和权重矩阵并依照之前实施例的方法执行计算。图10中的处理器的具体运算过程可参照之前的方法实施例。
具体的,分块信号矩阵,得到X行H列的多个第一分形信号矩阵,包括:MAC单元通过第一缓存器分别多次从信号矩阵中读取X行H列的第一分形信号矩阵,以得到X行H列的多个第一分形信号矩阵。其中,处理器可以从第四缓存器中读取X行H列的第一分形信号矩阵,并将X行H列的第一分形信号矩阵存储在第一缓存器中。第一缓存器的容量V1可以是固定的,且可以和X与H的乘积相等,即V1=X×H,第一分形信号矩阵可以填充满第一缓存器。
具体的,分块权重矩阵,得到H行Y列的多个第一分形权重矩阵,包括:MAC单元通过第二缓存器分别多次从权重矩阵中读取H行Y列的第一分形权重矩阵,以得到H行Y列的多个第一分形权重矩阵。其中,处理器可以从第四缓存器中读取H行Y列的第一分形权重矩阵,并将H行Y列的第一分形权重矩阵存储在第二缓存器中。第二缓存器的容量V2可以是固定的,且可以和H与Y的乘积等于,即V2=H×Y,第一分形权重矩阵可以填充满第二缓存器。
其中,X与第一缓存器的第一读写功耗正相关,Y与第二缓存器的第一读写功耗正相关,H分别与第一缓存器和第二缓存器的第一读写功耗反相关。X与第一缓存器的第一读写功耗正相关是指,当X越大时,第一缓存器的第一读写功耗越大,当X越小时,第一缓存器的第一读写功耗越小,比如,X与第一缓存器的第一读写功耗成正 比。Y与第二缓存器的第一读写功耗正相关是指,当Y越大时,第二缓存器的第一读写功耗越大,当Y越小时,第二缓存器的第一读写功耗越小,比如,Y与第二缓存器的第一读写功耗成正比。H与第一缓存器和第二缓存器的第一读写功耗反相关是指,当H越大时,第一缓存器和第二缓存器的第一读写功耗越小,当H越小时,第一缓存器和第二缓存器的第一读写功耗越大,比如,H与第一缓存器和第二缓存器的第一读写功耗成反比。
为便于理解,这里以一个第一分形信号矩阵和一个第一分形权重矩阵的矩阵乘为例,对MAC单元读取X行H列的第一分形信号矩阵和读取H行Y列的第一分形权重矩阵的过程中,X、Y和H与第一缓存器和第二缓存器的读写功耗的关系进行详细说明。
当MAC单元读取第一分形信号矩阵时,MAC单元需要先从第四缓存器存储的信号矩阵中通过第一缓存器读取一个X行H列的第一分形信号矩阵,即从第四缓存器读取的第一分形信号矩阵写入第一缓存器中,再从第一缓存器中读取第一分形信号矩阵。同理,当MAC单元读取第一分形权重矩阵时,MAC单元需要先从第四缓存器存储的权重矩阵中通过第二缓存器读取一个H行Y列的第一分形权重矩阵,即从第四缓存器读取的第一分形权重矩阵写入第二缓存器中,再从第二缓存器中读取第一分形权重矩阵。当MAC单元在进行矩阵乘运算时,由于矩阵乘运算是用第一分形信号矩阵中的每一行分别乘以第一分形权重矩阵中的每一列,所以,MAC单元需要通过X次读操作从第一缓存器中读取第一分形信号矩阵的X行,以及通过Y次读操作从第二缓存器中读取第一分形权重矩阵的Y列。由此可知,当X越大时,第一缓存器的读操作次数越大,进而第一缓存器的第一读写功耗越大,当X越小时,第一缓存器的读操作次数越小,进而第一缓存器的第一读写功耗越小,所以X与第一缓存器的第一读写功耗正相关。同理,当Y越大时,第二缓存器的读操作次数越大,进而第二缓存器的第一读写功耗越大,当Y越小时,第二缓存器的读操作次数越小,进而第二缓存器的第一读写功耗越小,所以Y与第二缓存器的第一读写功耗正相关。
由于第一缓存器和第二缓存器的容量通常是固定的,假设第一缓存器的容量V1=X×H固定,则当X越大时H越小,当X越小时H越大,所以H与第一缓存器的第一读写功耗反相关。假设第二缓存器的容量V2=H×Y固定,则当Y越大时H越小,当Y越小时H越大,所以H与第二缓存器的第一读写功耗反相关。可选的,第一缓存器的容量和第二缓存器的容量可以相等,即X×H=H×Y,则X和Y相等。
具体的,当MAC单元进行矩阵乘运算,第三缓存器用于存储矩阵乘结果、或者至少两个矩阵乘结果的累加时,该方法还可以包括:向第三缓存器中写入矩阵乘结果或至少两个矩阵乘结果的累加;和/或,从第三缓存器中读取矩阵乘结果或至少两个矩阵乘结果的累加。
其中,X和Y分别与第三缓存器的读写功耗反相关,H与第三缓存器的读写功耗正相关。X和Y分别与第三缓存器的读写功耗反相关,是指当X和Y越大时,第三缓存器的读写功耗越小,当X和Y越小时,第三缓存器读写功耗越大,比如,X和Y分别与第三缓存器的读写功耗成反比。
为便于理解,这里以一个第一分形信号矩阵和一个第一分形权重矩阵的矩阵乘为 例,对MAC单元进行X行H列的第一分形信号矩阵与H行Y列的第一分形权重矩阵的矩阵乘运算过程中,X、Y和H与第三缓存器的读写功耗的关系进行详细说明。
矩阵乘运算是用第一分形信号矩阵中的每一行分别乘以第一分形权重矩阵中的每一列,在进行行列相乘,以第一分形信号矩阵中的第1行(第1行中包括H个行元素)与第一分形权重矩阵中的第1列(第1行中包括H个列元素)相乘为例,则当MAC单元进行H个行元素与H个列元素的乘加运算时,MAC单元先计算第一个行元素与第一个列元素的第一乘积之后,将第一乘积写入第三缓存器中,再计算第二个行元素与第二个列元素的第二乘积,之后从第三缓存器中读取第一乘积,将第一乘积和第二乘积进行累加后写入第三缓存器中,以此类推,直到计算得到H个行元素与H个列元素的乘加运算的结果。
由此可知,当H越大时,则MAC单元对第三缓存器的读写次数越大,进而第三缓存器的读写功耗越大,当H越小时,则MAC单元对第三缓存器的读写次数越小,进而第三缓存器的读写功耗越小,所以H与第三缓存器的读写功耗反相关。
由于第一缓存器和第二缓存器的容量通常是固定的,即V1=X×H和V2=H×Y是固定的,因此,当H越大时,则X和Y越小,当H越小时,则X和Y越大。因此,X和Y分别与第三缓存器的读写功耗反相关,H与第三缓存器的读写功耗正相关。
需要说明的是,上述是以一个第一分形信号矩阵和一个第一分形权重矩阵的矩阵乘为例进行说明,此时第三缓存器中可用于存储的是一个行元素与一个列元素的乘积,或者乘积的累加也是示例性的,并不对本申请实施例构成限定。当MAC单元将多个第一分形信号矩阵和多个第一分形权重矩阵进行矩阵乘和累加运算的过程中,第三缓存器用于存储一个矩阵乘结果、或者至少两个矩阵乘结果。
另外,这里的第一缓存器的第一读写功耗包括向第一缓存器写入第一分形信号矩阵的功耗,以及从第一缓存器中读取第一分形信号矩阵的功耗;第二缓存器的第一读写功耗包括向第二缓存器写入第一分形权重矩阵的功耗,以及通过从第二缓存器中读取第一分形权重矩阵的功耗;第三缓存器的读写功耗包括向第三缓存器写入和读取矩阵乘运算结果或者至少两个矩阵乘运算结果的累加的功耗。
比如,以图8所示的矩阵分块为例,当处理器在按照上述步骤(I)和步骤(II)确定C00时,处理器可以在步骤(I)执行完成后可以将矩阵乘结果C00_temp的结果存储在第三缓存器中,当完成A01与B01的矩阵乘运算得到第二个矩阵乘结果之后,处理器可以从第三缓存器中读取C00_temp,并将其与第二个矩阵乘结果进行累加,以得到矩阵运算结果C00,将C00存储在第三缓存器中。
进一步的,如果获取的第一分形信号矩阵和第一分形权重矩阵的维护仍然较大,处理器无法一次完成对一个第一分形信号矩阵和一个第一分形权重矩阵的运算,则还可以对其进一步分块处理,以得到处理器可以处理的粒度。
比如,以图8所示的矩阵分块为例,如果分块后的A00、A01、A10和A11,以及B00、B01、B10和B11的粒度仍然较大,比如,处理器无法完成上述步骤(I)或步骤(II)的运算,则以步骤(I)的计算为例,处理器可以进一步将其分解为如下公式(5)所示。
Figure PCTCN2018099733-appb-000005
其中,矩阵A00(00)、A00(01)、A00(10)和A00(11)可以称为A00的分形矩阵,B00(00)、B00(01)、B00(10)和B00(11)可以称为B00的分形矩阵;相应的,矩阵C00可以由C00(00)、C00(01)、C00(10)和C00(11)组成。
在本申请实施例中,对于一个第一分形信号矩阵与对应的一个第一分形权重矩阵进行矩阵乘运算,得到一个矩阵乘运算结果,包括:分块第一分形信号矩阵,得到x行h列的多个第二分形信号矩阵,以及分块第一分形权重矩阵,得到h行y列的多个第二分形权重矩阵,每个第二分形信号矩阵和每个第二分形权重矩阵均满足非近似正方形;将多个第二分形信号矩阵和多个第二分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果。可选的,|x-h|≥2,|h-y|≥2。
结合图10,参见图11,当处理器还包括第一寄存器、第二寄存器和第三寄存器时,处理器可以通过第一寄存器与第一缓存器进行交互,通过第二寄存器与第二缓存器进行交互,以及通过第三寄存器与第三缓存器进行交互。其中,第一寄存器可以用于存储第二分形信号矩阵,即用于存储最小的分形信号矩阵,比如,用于图8中的存储A00(00)、A00(01)、A00(10)或者A00(11)。第二寄存器用于存储第二分形权重矩阵,即用于存储最小的分形权重矩阵,比如,用于存储图8中的B00(00)、B00(01)、B00(10)或者B00(11)。第三寄存器用于存储多个第二分形信号矩阵与多个第二分形权重矩阵的矩阵乘运算过程中的矩阵乘运算结果或者至少两个矩阵乘运算结果的累加,比如,用于存储图8中的A00(00)B00(00)或者A00(00)B00(01)等。
具体的,当MAC单元进行矩阵乘运算时,MAC单元读取的第一分形权重矩阵存储在第一缓存器中,读取的第一分形权重矩阵存储在第二缓存器中,MAC单元通过第一寄存器分别从第一缓存器中读取x行h列的第二分形信号矩阵,通过第二寄存器分别从第二缓存器中读取x行h列的第二分形权重矩阵。MAC单元将多个第二分形信号矩阵与多个第二分形权重矩阵的矩阵乘运算过程中的矩阵乘结果或者至少两个矩阵乘结果的累加通过第三寄存器存储在第三缓存器中,和/或,从第三缓存器中读取该矩阵乘结果或者至少两个矩阵乘结果的累加。
相应的,当根据多个第二分形信号矩阵和多个第二分形权重矩阵进行矩阵乘和累加运算时,x与第一缓存器的第二读写功耗正相关,y与第二缓存器的第二读写功耗正相关,h与第一缓存器和第二缓存器的第二读写功耗反相关。此外,x和y分别与第三缓存器的读写功耗反相关,h与第三缓存器的读写功耗正相关。其中,x、h和y与不同缓存器之间的读写功耗的关系分析与上述X、H和Y与不同缓存器之间的读写功耗的关系的分析类似,具体参见上述描述,本申请实施例在此不再赘述。
需要说明的是,这里的第一缓存器的第二读写功耗包括向第一缓存器写入第一分形信号矩阵的功耗,以及通过第一寄存器从第一缓存器中读取第二分形信号矩阵的功耗;第二缓存器的第二读写功耗包括向第二缓存器写入第一分形权重矩阵的功耗,以 及通过第二寄存器从第二缓存器中读取第二分形权重矩阵的功耗;第三缓存器的读写功耗包括通过第三寄存器向第三缓存器写入和读取矩阵乘运算结果或者至少两个矩阵乘运算结果的累加的功耗。
综上所述,以第一分形信号矩阵和第一分形权重矩阵为例,若第一缓存器和第二缓存器的容量相等(即X=Y,X×H=H×Y=常数)时,可以通过如下公式(6)表示MAC单元的总功耗与信号矩阵的行数和列数、权重矩阵的行数和列数、以及第一分形信号矩阵的行数X之间的关系。
E(X)=G 1(M,N,K)X+G 2(M,N,K)/X+G 3(M,N,K)     (6)
其中,X是自变量,M和K分别为信号矩阵的行数和列数,K和N分别为权重矩阵的行数和列数,G 1、G 2和G 3是与M、N和K相关的子函数。
进而,在对信号矩阵和权重矩阵进行分块时,可以根据功耗最低原则确定对应的X,相应的也即是确定Y和H,进而获取多个第一分形信号矩阵和多个第一分形权重矩阵,进行矩阵乘和累加运算时,可以实现处理器的功耗的最优设计。由于不同设备的功耗参数不同,如何针对X,Y和Z进行最优功耗设计可以结合对缓存器的性能参数理解和实际测试进行,具体取决于实际应用场景和器件选型,本实施例对此不做过多展开。
在实际应用中,也可以根据不同缓存器的容量、处理器的功耗和不同缓存器的带宽等确定第一分形信号矩阵的行数和列数,以及第一分形权重矩阵的行数和列数,从而在根据多个第一分形信号矩阵和多个第一分形权重矩阵确定输出矩阵时,可以使不同缓存器的容量和带宽得到充分利用,同时尽可能的降低处理器的功耗。根据以上实施例的介绍,***功耗与矩阵行数、列数、各缓存器的读写次数、或各缓存器的性能等各类参数存在一定关系,为了针对功耗进行优化,需要对各个缓存器的读写配置参数进行灵活调整,以利于降低功耗。为了适应这种灵活配置,本实施例设计了相关方法和装置,通过利用满足非近似正方形的分形信号矩阵和分形权重矩阵来执行计算,而不再将分形信号矩阵和分形权重矩阵严格限制在正方形,提高了设计灵活性,适应对缓存器的不同读写要求。
上述主要从设备的角度对本申请实施例提供的信号处理方法进行了介绍。可以理解的是,该设备为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的网元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对信号处理装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
在采用对应各个功能划分各个功能模块的情况下,图12示出了上述实施例中所涉 及的信号处理装置的一种可能的结构示意图,该信号处理装置包括:获取单元1201、处理单元1202和输出单元1203。其中,获取单元1201用于支持该信号处理装置执行图7中的步骤701;处理单元1202用于支持该信号处理装置执行图7中的步骤702和703,和/或用于本文所描述的技术的其他过程;输出单元1203用于该信号处理装置执行图7中的步骤704。
上面从模块化功能实体的角度对本申请实施例中的一种信号处理装置进行描述,下面从处理器硬件处理的角度对本申请实施例中的一种信号处理装置进行描述。
本申请实施例提供一种信号处理装置,该设备的结构可以如图2所示,该信号处理装置包括:存储器201、处理器202、通信接口203和总线204。其中,通信接口203可以包括输入接口2031和输出接口2032。
输入接口2031:该输入接口用于获取信号矩阵和/或权重矩阵,该输入接口可以通过选择器实现获取信号矩阵和获取权重矩阵的切换;在一些可行的实施例中,该输入接口可用以分时复用的方式获取上述的信号矩阵或权重矩阵;在一些可行的实施例中,该输入接口可以有两个,分别实现信号矩阵和权重矩阵的获取,例如可实现同时获取信号矩阵和权重矩阵。
处理器202:被配置为可处理上述信号处理方法的步骤702-步骤703部分的功能。在一些可行的实施例中,该处理器可以是单处理器结构、多处理器结构、单线程处理器以及多线程处理器等,在一些可行的实施例中,该处理器可以集成在专用集成电路中,也可以是独立于集成电路之外的处理器芯片。
输出接口2032:该输出接口用于输出上述信号处理方法中的信号处理结果,在一些可行的实施例中,该信号处理结果可以由处理器直接输出,也可以先被存储于存储器中,然后经存储器输出;在一些可行的实施例中,可以只有一个输出接口,也可以有多个输出接口。在一些可行的实施例中,该输出接口输出的信号处理结果可以送到存储器中存储,也可以送到下一个信号处理装置继续进行处理,或者送到显示设备进行显示、送到播放器终端进行播放等。
存储器201:该存储器中可存储上述的信号矩阵、信号处理结果、权重矩阵、以及配置处理器的相关指令等。在一些可行的实施例中,可以有一个存储器,也可以有多个存储器;该存储器可以是软盘,硬盘如内置硬盘和移动硬盘,磁盘,光盘,磁光盘如CD_ROM、DCD_ROM,非易失性存储设备如RAM、ROM、PROM、EPROM、EEPROM、闪存、或者技术领域内所公知的任意其他形式的存储介质。
本申请实施例提供的上述信号处理装置的各组成部分分别用于实现相对应的前述信号处理方法的各步骤的功能,由于在前述的信号处理方法实施例中,已经对各步骤进行了详细说明,在此不再赘述。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在一个设备(比如,该设备可以是单片机,芯片、计算机等)上运行时,使得该设备执行上述信号处理方法的步骤701-步骤704中的一个或多个步骤。上述信号处理装置的各组成模块如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在所述计算机可读取存储介质中。
基于这样的理解,本申请实施例还提供一种包含指令的计算机程序产品,本申请 的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或其中的处理器执行本申请各个实施例所述方法的全部或部分步骤。
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (17)

  1. 一种信号处理方法,其特征在于,应用于包含处理器的设备中,所述方法包括:
    获取信号矩阵和权重矩阵,所述信号矩阵为二维矩阵且包括多个计算机可处理的待处理信号,所述权重矩阵为二维矩阵且包括多个权重系数,所述信号矩阵的列数与所述权重矩阵的行数相等;
    分块所述信号矩阵,得到X行H列的多个第一分形信号矩阵,以及分块所述权重矩阵,得到H行Y列的多个第一分形权重矩阵,每个第一分形信号矩阵和每个第一分形权重矩阵均满足非近似正方形;
    将所述多个第一分形信号矩阵和所述多个第一分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果,所述多个矩阵运算结果用于形成信号处理结果,每个矩阵运算结果包括多个矩阵乘结果的累加,每个矩阵乘结果由一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算得到。
  2. 根据权利要求1所述的信号处理方法,其特征在于,所述满足所述非近似正方形包括:矩阵的行数与列数的差值的绝对值大于或等于2。
  3. 根据权利要求1或2所述的信号处理方法,其特征在于,所述处理器包括第一缓存器和第二缓存器,所述分块所述信号矩阵,得到X行H列的多个第一分形信号矩阵,以及分块所述权重矩阵,得到H行Y列的多个第一分形权重矩阵,包括:
    通过所述第一缓存器分别多次从所述信号矩阵中读取X行H列的多个第一分形信号矩阵;
    通过所述第二缓存器分别多次从所述权重矩阵中读取H行Y列的多个第一分形权重矩阵。
  4. 根据权利要求1-3任一项所述的信号处理方法,其特征在于,所述处理器还包括第三缓存器,所述方法还包括:
    向所述第三缓存器中写入所述矩阵乘结果或至少两个所述矩阵乘结果的累加。
  5. 根据权利要求1-4任一项所述的信号处理方法,其特征在于,对于一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算,得到一个矩阵乘结果,包括:
    分块所述第一分形信号矩阵,得到x行h列的多个第二分形信号矩阵,以及分块所述第一分形权重矩阵,得到h行y列的多个第二分形权重矩阵,每个第二分形信号矩阵和每个第二分形权重矩阵均满足所述非近似正方形;
    将所述多个第二分形信号矩阵和所述多个第二分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果。
  6. 一种信号处理装置,其特征在于,所述装置包括:
    获取单元,用于获取信号矩阵和权重矩阵,所述信号矩阵为二维矩阵且包括多个计算机可处理的待处理信号,所述权重矩阵为二维矩阵且包括多个权重系数,所述信号矩阵的列数与所述权重矩阵的行数相等;
    处理单元,用于分块所述信号矩阵,得到X行H列的多个第一分形信号矩阵,以及分块所述权重矩阵,得到H行Y列的多个第一分形权重矩阵,每个第一分形信号矩阵和每个第一分形权重矩阵均满足非近似正方形,以及将所述多个第一分形信号矩阵 和所述多个第一分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果,所述多个矩阵运算结果用于形成信号处理结果,每个矩阵运算结果包括多个矩阵乘结果的累加,每个矩阵乘结果由一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算得到。
  7. 根据权利要求6所述的信号处理装置,其特征在于,所述满足所述非近似正方形包括:矩阵的行数与列数的差值的绝对值大于或等于2。
  8. 根据权利要求6或7所述的信号处理装置,其特征在于,所述处理单元包括第一缓存器和第二缓存器,所述处理单元,具体用于:
    通过所述第一缓存器分别多次从所述信号矩阵中读取X行H列的多个第一分形信号矩阵;
    通过所述第二缓存器分别多次从所述权重矩阵中读取H行Y列的多个第一分形权重矩阵。
  9. 根据权利要求6-8任一项所述的信号处理装置,其特征在于,所述处理单元还包括第三缓存器,所述处理单元,还用于:
    向所述第三缓存器中写入所述矩阵乘结果或至少两个所述矩阵乘结果的累加。
  10. 根据权利要求6-9任一项所述的信号处理装置,其特征在于,对于一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算,所述处理单元,还用于:
    分块所述第一分形信号矩阵,得到x行h列的多个第二分形信号矩阵,以及分块所述第一分形权重矩阵,得到h行y列的多个第二分形权重矩阵,每个第二分形信号矩阵和每个第二分形权重矩阵均满足所述非近似正方形;
    将所述多个第二分形信号矩阵和所述多个第二分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果。
  11. 一种信号处理装置,其特征在于,所述装置包括:
    输入接口,用于获取信号矩阵和权重矩阵,所述信号矩阵为二维矩阵且包括多个计算机可处理的待处理信号,所述权重矩阵为二维矩阵且包括多个权重系数,所述信号矩阵的列数与所述权重矩阵的行数相等;
    处理器,被配置为可处理如下操作:
    分块所述信号矩阵,得到X行H列的多个第一分形信号矩阵,以及分块所述权重矩阵,得到H行Y列的多个第一分形权重矩阵,每个第一分形信号矩阵和每个第一分形权重矩阵均满足非近似正方形;
    将所述多个第一分形信号矩阵和所述多个第一分形权重矩阵进行矩阵乘和累加运算得到多个矩阵运算结果,所述多个矩阵运算结果用于形成信号处理结果,每个矩阵运算结果包括多个矩阵乘结果的累加,每个矩阵乘结果由一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算得到。
  12. 根据权利要求11所述的信号处理装置,其特征在于,所述满足所述非近似正方形包括:矩阵的行数与列数的差值的绝对值大于或等于2。
  13. 根据权利要求11或12所述的信号处理装置,其特征在于,所述处理器包括第一缓存器和第二缓存器,所述处理器还执行以下操作:
    通过所述第一缓存器分别多次从所述信号矩阵中读取X行H列的多个第一分形信 号矩阵;
    通过所述第二缓存器分别多次从所述权重矩阵中读取H行Y列的多个第一分形权重矩阵。
  14. 根据权利要求11-13任一项所述的信号处理装置,其特征在于,所述处理器还包括第三缓存器,所述处理器还执行以下操作:
    向所述第三缓存器中写入所述矩阵乘结果或至少两个所述矩阵乘结果的累加。
  15. 根据权利要求11-14任一项所述的信号处理装置,其特征在于,对于一个第一分形信号矩阵与一个第一分形权重矩阵进行矩阵乘运算,所述处理器还执行以下操作:
    分块所述第一分形信号矩阵,得到x行h列的多个第二分形信号矩阵,以及分块所述第一分形权重矩阵,得到h行y列的多个第二分形权重矩阵,每个第二分形信号矩阵和每个第二分形权重矩阵均满足所述非近似正方形;
    将所述多个第二分形信号矩阵和所述多个第二分形权重矩阵进行矩阵乘和累加运算,得到多个矩阵运算结果。
  16. 一种可读存储介质,其特征在于,所述可读存储介质中存储有指令,当所述可读存储介质在设备上运行时,使得所述设备执行权利要求1-5任一项所述的信号处理方法。
  17. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行权利要求1-5任一项所述的信号处理方法。
PCT/CN2018/099733 2017-12-29 2018-08-09 一种信号处理方法及装置 WO2019128248A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP18896420.9A EP3663938B1 (en) 2017-12-29 2018-08-09 Signal processing method and apparatus
US16/819,976 US20200218777A1 (en) 2017-12-29 2020-03-16 Signal Processing Method and Apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711481199.4 2017-12-29
CN201711481199.4A CN109992742A (zh) 2017-12-29 2017-12-29 一种信号处理方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/819,976 Continuation US20200218777A1 (en) 2017-12-29 2020-03-16 Signal Processing Method and Apparatus

Publications (1)

Publication Number Publication Date
WO2019128248A1 true WO2019128248A1 (zh) 2019-07-04

Family

ID=67065022

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/099733 WO2019128248A1 (zh) 2017-12-29 2018-08-09 一种信号处理方法及装置

Country Status (4)

Country Link
US (1) US20200218777A1 (zh)
EP (1) EP3663938B1 (zh)
CN (1) CN109992742A (zh)
WO (1) WO2019128248A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111462514B (zh) * 2020-03-23 2023-11-03 腾讯科技(深圳)有限公司 自动驾驶控制方法和相关装置
CN112712173B (zh) * 2020-12-31 2024-06-07 北京清微智能科技有限公司 基于mac乘加阵列的稀疏化运算数据的获取方法及***
CN112766467B (zh) * 2021-04-06 2021-08-20 深圳市一心视觉科技有限公司 基于卷积神经网络模型的图像识别方法
CN117370717B (zh) * 2023-12-06 2024-03-26 珠海錾芯半导体有限公司 一种二分坐标下降的迭代优化方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104170274A (zh) * 2014-03-10 2014-11-26 华为技术有限公司 处理信号的装置和方法
CN106127297A (zh) * 2016-06-02 2016-11-16 中国科学院自动化研究所 基于张量分解的深度卷积神经网络的加速与压缩方法
WO2017051358A1 (en) * 2015-09-25 2017-03-30 Sisvel Technology Srl Methods and apparatuses for encoding and decoding digital images through superpixels
CN107239824A (zh) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 用于实现稀疏卷积神经网络加速器的装置和方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8051124B2 (en) * 2007-07-19 2011-11-01 Itt Manufacturing Enterprises, Inc. High speed and efficient matrix multiplication hardware module
US10223333B2 (en) * 2014-08-29 2019-03-05 Nvidia Corporation Performing multi-convolution operations in a parallel processing system
CN104915322B (zh) * 2015-06-09 2018-05-01 中国人民解放军国防科学技术大学 一种卷积神经网络硬件加速方法
US10275394B2 (en) * 2015-10-08 2019-04-30 Via Alliance Semiconductor Co., Ltd. Processor with architectural neural network execution unit
CN105426344A (zh) * 2015-11-09 2016-03-23 南京大学 基于Spark的分布式大规模矩阵乘法的矩阵计算方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104170274A (zh) * 2014-03-10 2014-11-26 华为技术有限公司 处理信号的装置和方法
WO2017051358A1 (en) * 2015-09-25 2017-03-30 Sisvel Technology Srl Methods and apparatuses for encoding and decoding digital images through superpixels
CN106127297A (zh) * 2016-06-02 2016-11-16 中国科学院自动化研究所 基于张量分解的深度卷积神经网络的加速与压缩方法
CN107239824A (zh) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 用于实现稀疏卷积神经网络加速器的装置和方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3663938A4

Also Published As

Publication number Publication date
EP3663938A1 (en) 2020-06-10
US20200218777A1 (en) 2020-07-09
CN109992742A (zh) 2019-07-09
EP3663938B1 (en) 2022-10-26
EP3663938A4 (en) 2021-01-27

Similar Documents

Publication Publication Date Title
JP6857286B2 (ja) ニューラルネットワークアレイの性能の改善
WO2019128248A1 (zh) 一种信号处理方法及装置
US10936937B2 (en) Convolution operation device and convolution operation method
WO2020073211A1 (zh) 运算加速器、处理方法及相关设备
US9411726B2 (en) Low power computation architecture
CN108629406B (zh) 用于卷积神经网络的运算装置
WO2019238029A1 (zh) 卷积神经网络***和卷积神经网络量化的方法
CN111340077B (zh) 基于注意力机制的视差图获取方法和装置
CN109993275B (zh) 一种信号处理方法及装置
CN112789627B (zh) 一种神经网络处理器、数据处理方法及相关设备
WO2019001323A1 (zh) 信号处理的***和方法
US20210042616A1 (en) Arithmetic processing device
US11775807B2 (en) Artificial neural network and method of controlling fixed point in the same
EP4379607A1 (en) Neural network accelerator, and data processing method for neural network accelerator
CN111353598A (zh) 一种神经网络压缩方法、电子设备及计算机可读介质
WO2019095333A1 (zh) 一种数据处理方法及设备
CN113630375A (zh) 使用四叉树方法的参数的压缩设备及方法
WO2023109748A1 (zh) 一种神经网络的调整方法及相应装置
WO2021081854A1 (zh) 一种卷积运算电路和卷积运算方法
WO2023122896A1 (zh) 一种数据处理方法和装置
CN109146069B (zh) 运算装置、运算方法和芯片
WO2021179117A1 (zh) 神经网络通道数搜索方法和装置
WO2021120036A1 (zh) 数据处理装置和数据处理方法
CN111382835A (zh) 一种神经网络压缩方法、电子设备及计算机可读介质
CN113095211B (zh) 一种图像处理方法、***及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18896420

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2018896420

Country of ref document: EP

Effective date: 20200305

NENP Non-entry into the national phase

Ref country code: DE