WO2021238289A1 - 序列处理的方法与装置 - Google Patents

序列处理的方法与装置 Download PDF

Info

Publication number
WO2021238289A1
WO2021238289A1 PCT/CN2021/073868 CN2021073868W WO2021238289A1 WO 2021238289 A1 WO2021238289 A1 WO 2021238289A1 CN 2021073868 W CN2021073868 W CN 2021073868W WO 2021238289 A1 WO2021238289 A1 WO 2021238289A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
self
attention
window
elements
Prior art date
Application number
PCT/CN2021/073868
Other languages
English (en)
French (fr)
Inventor
黄文勇
杨宇庭
陈晓
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21813598.6A priority Critical patent/EP4152203A4/en
Publication of WO2021238289A1 publication Critical patent/WO2021238289A1/zh
Priority to US17/994,068 priority patent/US20230088915A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a method and device for sequence processing.
  • NLP natural language processing
  • a sequence consists of several elements in a sequence.
  • voice data can be expressed as a sequence of sampling points as elements.
  • text data can be expressed as a sequence with words as elements.
  • the meaning of a certain element in the sequence is often related to other elements in the sequence.
  • How to model the relationship between the elements in the sequence is the key to the sequence processing problem.
  • methods for modeling the relationship between elements in a sequence include recurrent neural networks (RNN), convolutional neural networks (convolutional neural networks, CNN), and self-attention.
  • RNN recurrent neural networks
  • CNN convolutional neural networks
  • self-attention is a way to get the representation of an element in the sequence by establishing the relationship between an element in the sequence and other elements in the sequence.
  • the traditional method of self-attention is to establish the relationship between this element and all elements in the sequence for an element, which leads to a large amount of calculation for self-attention.
  • the solution proposed by the current technology is to fix the use of several elements near the element to perform self-attention calculations for an element.
  • this scheme will produce self-attention dependent ranges. The question of limitation.
  • the present application provides a method and device for sequence processing, which can better balance the calculation amount and dependence range of self-attention.
  • a method for sequence processing includes: receiving an input sequence, the input sequence includes a plurality of sequential elements; for the first element in the input sequence, using M windows Perform self-attention calculations on the elements contained in to obtain the representation of the first element, wherein each of the M windows contains one element or consecutive multiple elements in the input sequence, and different windows There is at least one element between them, at least one of the M windows does not contain the first element, and M is an integer greater than or equal to 1; based on the representation of the first element, the corresponding input sequence is obtained Output sequence.
  • the first window contains one of the input sequence except the first element Element or multiple consecutive elements, in other words, the first window skips the first element and contains the other element or multiple consecutive elements in the input sequence.
  • the adjacent elements of the first element are also not included in the first window.
  • the position of the first window can be flexibly configured, not fixed. As long as the first element (or its neighboring elements) is skipped, the first window can be located anywhere on the input sequence.
  • the size of the first window is also configurable and not fixed.
  • the elements in the first window can be used instead of all elements in the sequence to perform self-attention calculations, which can reduce the amount of self-attention calculations.
  • the self-attention calculation is performed on the first element in the sequence based on the elements in the first window, because the first window can skip the first element and its neighboring elements, and the position of the first window can be It is not fixed, therefore, compared with the prior art, the limitation on the range of self-attention can be reduced.
  • the embodiments of the present application can better balance the calculation amount and dependence range of self-attention.
  • the method further includes: determining the M windows according to the position of the first element in the input sequence, and the M The window includes a first window, and the first window includes elements in the input sequence whose dependency length with the first element is greater than or equal to a and less than b, where a is an integer greater than 1, and b is greater than An integer of a, and the dependent length represents the distance between the first element and the elements in the M windows.
  • the values of a and b can be flexibly configured according to application requirements to reasonably determine the position of the first window, thereby selecting a reasonable range of self-attention dependence.
  • the dependence range of self-attention means that for an element, the dependence length range between other elements that establish a relationship with it (that is, perform self-attention calculations) and the element.
  • the dependency length represents the distance between the element and other elements.
  • the method is applied to multiple self-attention layers, and the input sequence is a sequence output by the self-attention layer at the previous level of the current self-attention layer; where the values of b and a The value is set such that the self-attention calculation of the current self-attention layer for the first element and the self-attention calculation of the previous-level self-attention layer for the first element are not double-calculated.
  • the previous level of self-attention layer performs self-attention calculation on the first element based on the elements contained in the fifth window
  • the fifth window contains the sequence that has a dependency length greater than that of the first element
  • b1 is a positive integer
  • a1 is a non-negative integer less than b1
  • the value of a is greater than the value of b1.
  • the self-attention dependency range of the first element can be flexibly selected, so that it can be further Reduce the limitation on the range of self-attention.
  • the position of the first window may be preset.
  • M is greater than 1, and the value of M is preset.
  • M is greater than 1, which means that you can use elements in multiple windows to perform self-attention calculations on the first element.
  • the value of M is preset, which means that the value of M has nothing to do with the length of the input sequence. In other words, the value of M may not increase as the length of the input sequence increases.
  • the self-attention calculation is performed on the elements in the sequence by using more than one window, which can ensure the dependence range of self-attention. It can be understood that for an element, the more windows for self-attention calculation, the larger the self-attention dependence range of the element. In the embodiment of the present application, the number of windows can be set reasonably to ensure the dependence range of the self-attention.
  • the number M of windows for performing self-attention calculation on an element has nothing to do with the length of the input sequence. Therefore, it can avoid the problem that the calculation overhead in the prior art increases with the length of the input sequence squared. Therefore, relative to The existing technology can reduce the amount of self-attention calculations.
  • one or more elements are spaced between different windows in the M windows, which can also reduce the amount of calculation for self-attention.
  • the M windows include a second window and/or a third window.
  • a second window the second window includes elements in the input sequence that have a dependency length greater than or equal to al and less than bl that are located in front of the first element and are less than bl, bl is a positive integer, and al is A non-negative integer less than bl.
  • a third window includes elements in the input sequence that are located after the first element and whose dependency length with the first element is greater than or equal to ar and less than br, br is a positive integer, and ar is A non-negative integer less than br.
  • al and ar may be equal or unequal
  • bl and br may be equal or unequal
  • the M windows include a fourth window, and the fourth window includes the first element and its neighboring elements.
  • the input sequence is a speech sequence or a text sequence.
  • a device for sequence processing includes a receiving unit, a processing unit, and an output unit.
  • the receiving unit is configured to receive an input sequence, and the input sequence includes a plurality of elements in a sequence.
  • the processing unit is configured to perform self-attention calculations on the first element in the input sequence using elements contained in M windows to obtain a representation of the first element, wherein each of the M windows Each window contains one element or multiple consecutive elements in the input sequence, and there is at least one element between different windows, at least one of the M windows does not contain the first element, and M is greater than Or an integer equal to 1.
  • the output unit is configured to obtain an output sequence corresponding to the input sequence based on the representation of the first element.
  • the processing unit is further configured to determine the M windows according to the position of the first element in the input sequence, and
  • the M windows include a first window, and the first window includes elements in the input sequence whose dependency length with the first element is greater than or equal to a and less than b, where a is an integer greater than 1, and b Being an integer greater than a, the dependency length represents the distance between the first element and the elements in the M windows.
  • the device is applied to multiple self-attention layers, and the input sequence is the output of the previous self-attention layer of the current self-attention layer Sequence; where the values of b and a are set to make the self-attention calculation of the current self-attention layer on the first element and the self-attention of the previous self-attention layer on the first element Force calculations are not double-calculated.
  • the previous self-attention layer performs self-attention calculation on the first element based on the elements contained in the fifth window, and the first The five-window includes elements in the sequence whose dependent length with the first element is greater than or equal to a1 and less than b1, b1 is a positive integer, and a1 is a non-negative integer less than b1; where the value of a is greater than b1 Value.
  • M is greater than 1, and the value of M is preset.
  • the M windows include a second window and/or a third window.
  • the second window and the third window please refer to the preceding text, so I won't repeat them here.
  • the M windows include a fourth window, and the fourth window includes the first element and its neighboring elements.
  • the input sequence is a speech sequence or a text sequence.
  • a neural network processing device which includes an input module, a processing module, an output module, and the sequence processing device according to any one of claims 9-16.
  • the input module is used to input the input sequence into the sequence processing device; the sequence processing device is used to perform self-attention calculation on the input sequence to obtain the output sequence corresponding to the input sequence;
  • the processing module is used to process the output sequence to obtain a sequence processing result; the output module is used to output an output signal based on the sequence processing result obtained by the processing module.
  • the processing module when the input sequence is a voice sequence, the processing module is configured to perform voice recognition processing on the output sequence to obtain a voice recognition result; or when the input sequence is a text sequence, the The processing module is used to perform semantic understanding processing on the output sequence to obtain a semantic understanding result.
  • a data processing device in a fourth aspect, includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the first The method in the aspect.
  • a computer-readable medium stores a program code for execution by a device, and the program code includes a method for executing the above-mentioned method in the first aspect.
  • a computer program product containing instructions, which when the computer program product runs on a computer, causes the computer to execute the method in the above-mentioned first aspect.
  • a chip in a seventh aspect, includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface, and executes the method in the first aspect.
  • the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory.
  • the processor is used to execute the method in the first aspect described above.
  • the first element in the sequence can be calculated based on the elements in the first window, because the first window can skip the first element and its neighboring elements, Moreover, the position of the first window may not be fixed, therefore, compared with the prior art, the limitation on the range of dependence on self-attention can be reduced.
  • this application can use elements in multiple windows to perform self-attention calculations.
  • the number of the multiple windows is independent of the length of the sequence, and there are gaps between different windows. While reducing the calculation amount of self-attention, it is possible to give consideration to the dependence range of self-attention as much as possible, so as to achieve a balance between the calculation amount of self-attention and the dependence range.
  • Figure 1 is a schematic diagram of the self-attention mechanism.
  • Figure 2 is a schematic diagram of the architecture of a neural network including a self-attention layer.
  • Figure 3 is a schematic diagram of the local self-attention mechanism.
  • FIG. 4 is a schematic flowchart of a sequence processing method provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of a window used to perform self-attention calculation on the first element in a sequence in an embodiment of the application.
  • FIG. 6 is another schematic flowchart of a sequence processing method provided by an embodiment of the application.
  • FIG. 7 is a schematic diagram of a window for performing self-attention calculation on the first element in the sequence when the embodiment of the application is applied to multiple self-attention layer scenes.
  • 8 to 12 are schematic diagrams of M windows used to perform self-attention calculation on the first element in the sequence in an embodiment of the application.
  • FIG. 13 is a schematic diagram of performing self-attention calculation on elements in a sequence when an embodiment of the application is applied to a scene with multiple self-attention layers.
  • Fig. 14 is a schematic diagram of adopting a local self-attention mechanism in a scene with multiple self-attention layers.
  • FIG. 15 is a schematic block diagram of a sequence processing apparatus provided by an embodiment of the application.
  • FIG. 16 is another schematic block diagram of a sequence processing apparatus provided by an embodiment of this application.
  • FIG. 17 is a schematic block diagram of a neural network processing device provided by an embodiment of the application.
  • FIG. 18 is a schematic block diagram of a speech recognition system provided by an embodiment of this application.
  • FIG. 19 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • sequence data sequence data
  • sequence processing problems for short.
  • an input sentence can be expressed as a word sequence.
  • Word sequences can also be called text sequences.
  • a continuous speech is divided into frames of equal time, which can be expressed as a sequence of frames.
  • the frame sequence can also be referred to as a speech sequence.
  • the sequence is composed of several elements, and there is a sequence between the elements.
  • speech data can be expressed as a sequence with sampling points as elements.
  • text data can be expressed as a sequence with words as elements.
  • “he”, “zai”, “26”, “sui”, “foundation”, “famous”, “de”, “narrow” and “relativity” are the text sequence "he” respectively.
  • methods for modeling the relationship between elements in a sequence include recurrent neural network (RNN), convolutional neural network (convolutional neural networks, CNN), and self-attention.
  • RNN recurrent neural network
  • CNN convolutional neural networks
  • self-attention is a way to obtain the representation of this element by establishing the relationship between a certain element in the sequence and other elements in the sequence.
  • self-attention is a method used to model the relationship between elements in a sequence to get a better element representation.
  • the representation obtained after self-attention calculation can be called a new representation of the element.
  • Self-attention can be used as a layer in a neural network.
  • a neural network that includes a self-attention layer can also be called an input sequence processor.
  • Fig. 2 shows a schematic block diagram of the input sequence processor.
  • the input sequence processor is a neural network including a self-attention layer, and the neural network may also include other neural network layers.
  • the sequence to be processed is input to the sequence processor, and the self-attention layer performs self-attention operations on the sequence to obtain a new representation of each element in the sequence, thereby obtaining a new sequence, which is input to other nerves
  • the network layer processes, and finally obtains the sequence processing result, that is, the sequence processor outputs the sequence processing result.
  • the sequence to be processed is a text sequence
  • the sequence processing result output by the sequence processor may be a text processing result such as a semantic understanding result or a machine translation result.
  • the sequence to be processed is a voice sequence
  • the sequence processing result output by the sequence processor may be a voice processing result such as a voice recognition result.
  • the sequence to be processed can be input to the sequence processor after being processed by the feature extraction module.
  • the sequence processor may include one or more self-attention layers, and in a scene including multiple self-attention layers, other neural network layers may be included between the two self-attention layers.
  • the architecture design of the neural network including the self-attention layer is an existing technology, and will not be described in detail in this article.
  • H ⁇ h 1, h 2 , ... h i, ..., h L ⁇
  • h represents an element of the sequence H.
  • each element having a width h i d is represented as a vector.
  • H i using the relationship with other elements from the sequence of attention modeling elements, and with this new presentation element h i, h 'i, the process may be expressed as follows.
  • Q(), K() and V() are usually a linear map respectively.
  • d represents the width of the vector used to represent the elements, that is, each element in the sequence is represented by a vector of width d.
  • Softmax() represents the normalized exponential function.
  • the self-attention calculation for a sequence is performed in the manner shown in the above formula, where the calculation amount of the self-attention of a single element is O(Ld), and the calculation amount of the entire sequence is O(L 2 d). It can be seen that if the self-attention calculation of a sequence is performed in the manner shown in the above formula, the calculation cost will increase with the square of the length of the input sequence, so when dealing with long sequences, there is often a problem of excessive calculation.
  • the dependence range of self-attention means that for an element, the dependence length range between other elements that establish a relationship with it (that is, perform self-attention calculations) and the element.
  • the dependency length represents the distance between the element and other elements. For example, in the example shown in Figure 3, for the element "Creation”, assuming that the dependency length between it and itself, that is, “Creation” is recorded as 0, then the dependency length between the element “Creation” and the element “Yi” is 1 (Similarly, the dependency length between the element “famous” is also 1), and the dependency length between the element "found” and the element “26” is 2 (similarly, the dependency length between the element " ⁇ ” is also Is 2). That is, in the example of FIG. 3, the dependency range for the self-attention calculation of the element "create” is 0-2.
  • this application proposes a method and device for sequence processing, which can better achieve a balance between the amount of self-attention calculation and the range of dependence.
  • FIG. 4 is a schematic flowchart of a sequence processing method 400 provided by an embodiment of the application.
  • the method 400 includes step S410, step S430, and step S430.
  • S410 Receive an input sequence, where the input sequence includes multiple elements in a sequence.
  • the input sequence represents the sequence to be processed for self-attention.
  • the method 400 is executed using the self-attention layer shown in FIG. 2, and the input sequence may be a sequence output by the previous neural network layer of the self-attention layer.
  • the input sequence may be a speech sequence.
  • a continuous speech is divided into frames of equal time, and the resulting frame sequence can be called a speech sequence.
  • a speech sequence is a sequence whose elements are sampling points.
  • the input sequence may be a text sequence.
  • an input sentence can be expressed as a word sequence.
  • Word sequences can also be called text sequences.
  • the text sequence is a sequence in which the elements are words.
  • S420 Perform self-attention calculation on the first element in the input sequence using elements contained in M windows to obtain a representation of the first element, where each window in the M windows contains an element in the input sequence or Multiple consecutive elements, and at least one element is spaced between different windows, at least one of the M windows does not contain the first element, and M is an integer greater than or equal to 1.
  • the first element represents any element in the input sequence.
  • the self-attention processing of a sequence includes the self-attention calculation of each element in the sequence.
  • the first element is used as an example for description in the embodiments of the present application.
  • the first element represents any element in the input sequence. In other words, for any element in the input sequence, a self-attention calculation is performed on the element in the manner of step S420 to obtain a representation of the element.
  • M When M is equal to 1, it means that you can use the elements in a window (denoted as the first window) to perform self-attention calculation on the first element, and the first window contains one of the input sequence except the first element Element or multiple consecutive elements, in other words, the first window skips the first element and contains the other element or multiple consecutive elements in the input sequence.
  • the adjacent elements of the first element are also not included in the first window.
  • the adjacent elements of the first element include elements adjacent to the first element.
  • the first element is element 7, and adjacent elements of the first element include element 6 adjacent to the front and element 8 adjacent to the right.
  • M When M is greater than 1, it means that you can use elements in multiple windows to perform self-attention calculations on the first element. The situation when M is greater than 1 will be described below.
  • the following first describes the self-attention calculation of the first element using the elements in the first window in step S420 as an example.
  • the position of the first window can be flexibly configured, not fixed. As long as the first element (or its neighboring elements) is skipped, the first window can be located anywhere on the input sequence.
  • the first window is located in front of the first element.
  • the first window is located behind the first element.
  • the first window is located behind the first element; when the first element is the last element in the input sequence, the first window is located next to the first element Front; In the case where the first element is the middle element in the input sequence, the first window can be located in front of or behind the first element.
  • the position of the first window can be reasonably determined according to application requirements.
  • the size of the first window is also configurable and not fixed.
  • the first window contains 1, 2, 3 or more elements.
  • the size of the first window can be reasonably configured according to application requirements.
  • the input sequence is composed of element 1 to element 15.
  • the first element is element 7, and the first window can be any one of window 1, window 2, and window 3 shown in Fig. 5 window.
  • the following formula can be used to perform self-attention calculation on the first element h i to obtain a new representation h′ i of the first element:
  • S represents a set of elements used to perform self-attention calculations on the first element, and S includes elements in the first window.
  • Attend() represents the calculation method of self-attention.
  • the calculation method of self-attention is the prior art, which will not be described in detail in this article.
  • S430 Obtain an output sequence corresponding to the input sequence based on the representation of the first element.
  • step S430 the output sequence is obtained based on the representation of each element in the input sequence. Among them, for each element in the input sequence, the representation of the corresponding element is obtained in step S420.
  • the method 400 is executed using the self-attention layer shown in FIG. 2, and the output sequence obtained in step S430 can be passed to the next neural network layer of the self-attention layer for subsequent processing.
  • the self-attention calculation is performed based on the elements in the first window instead of all elements in the sequence, which can reduce the amount of self-attention calculations.
  • the self-attention calculation is performed on the first element in the sequence based on the elements in the first window, because the first window can skip the first element and its neighboring elements, and the The position may not be fixed, therefore, compared with the prior art, the limitation on the range of dependence on self-attention can be reduced.
  • the embodiments of the present application can better balance the calculation amount and dependence range of self-attention.
  • the window used for self-attention calculation is not fixed and can be dynamically changed. Therefore, the self-attention mechanism provided by the embodiments of the present application can be called jumping self-attention.
  • the position of the first window on the input sequence can be determined in a variety of ways.
  • the position of the first window is determined according to the position of the first element.
  • the first window For example, set the first window to include elements in the input sequence whose dependency length with the first element is greater than or equal to a and less than b, and the dependency length represents the distance between the first element and the elements in the first window, where a Is an integer greater than 1, and b is an integer greater than a. It should be understood that the value of b is less than the length of the input sequence.
  • the method 400 may further include step S440.
  • step S440 Determine a first window according to the position of the first element in the input sequence.
  • the first window includes elements in the input sequence whose dependency length with the first element is greater than or equal to a and less than b, where a is greater than 1.
  • Integer, b is an integer greater than a.
  • step S420 use the elements in the first window to perform self-attention calculation on the first element to obtain a representation of the first element.
  • the following formula can be used to perform self-attention calculation on the first element h i to obtain the new representation h′ i of the first element:
  • the first window may be window 1. If the first window is set to include elements whose dependency length with the first element in the input sequence is greater than 1 and less than 5, then the first window is still window 1. If the first window is set to include elements in the input sequence whose dependency length with the first element is greater than 2 and less than 6 (or 7, or 8), then the first window is window 2. If the first window is set to include elements whose dependency length with the first element in the input sequence is greater than 6 and less than 9, then the first window is window 3.
  • the self-attention dependency range of the first element can be flexibly selected, so Further reduce the limitation on the range of self-attention.
  • the position of the first window is preset.
  • the position of the first window has nothing to do with the position of the first element.
  • a neural network that includes a self-attention layer usually includes multiple self-attention layers, as shown in Figure 2 " ⁇ N", which means that the neural network can include N as shown by the dashed line in Figure 2
  • the layer combination which includes multiple self-attention layers.
  • the sequence processing method provided in the embodiments of the present application can be applied not only to a single self-attention layer, but also to multiple autonomous attention layers. Among them, by reasonably setting the positions of the windows on two adjacent layers, the amount of self-attention calculation can be further reduced. This will be described below.
  • the method 400 is applied to multiple self-attention layers
  • the input sequence is the sequence output by the previous self-attention layer of the current self-attention layer
  • the position of the first window is Determined according to the position of the first element
  • the first window contains elements in the input sequence whose dependency length with the first element is greater than or equal to a and less than b, where the values of b and a are set to make the current self-attention
  • the self-attention calculation of the first element by the force layer and the self-attention calculation of the first element by the previous level of self-attention layer are not double-calculated.
  • the current self-attention layer is recorded as the self-attention layer X
  • the self-attention layer before the self-attention layer X is recorded as the self-attention layer (X-1).
  • the self-attention layer (X-1) has established the relationship between the first element and element 1 when performing the self-attention calculation on the first element
  • the values of b and a are set, You can skip element 1 and use other elements to perform self-attention calculations on the first element.
  • the self-attention layer (X-1) performs self-attention calculations on the first element based on the elements contained in the fifth window, and the fifth window contains a sequence whose dependence on the first element is greater than or equal to a1 and less than b1
  • the fifth window contains a sequence whose dependence on the first element is greater than or equal to a1 and less than b1
  • b1 is a positive integer
  • a1 is a non-negative integer less than b1
  • the value of a is greater than the value of b1.
  • the input sequence consists of element 1 to element 15, and the first element is element 7.
  • the self-attention layer (X-1) is the previous self-attention layer of the self-attention layer X, and the input sequence of the self-attention layer X is obtained based on the output sequence of the self-attention layer (X-1).
  • the self-attention layer (X-1) uses the elements in window 1 to perform self-attention calculations on the first element, and the self-attention layer X can skip elements 6, 7, and when performing self-attention calculations on the first element. 8. For example, you can use the elements in window 2, window 3, or window 4 to perform calculations, which can avoid double calculations.
  • FIG. 7 is only an example and not a limitation.
  • the windows on two adjacent self-attention layers can be coordinately set up and down according to specific needs to reduce the amount of self-attention calculations.
  • the input sequence of the self-attention layer X is directly the self-attention layer (X -1) The output sequence.
  • the input sequence of the self-attention layer X is the output sequence of the self-attention layer (X-1) The output sequence after processing by other neural network layers.
  • the dependency length of the first element in the containing sequence is greater than a and less than The elements in the window of b elements perform self-attention calculations.
  • the self-attention layer 1 is the previous level of the self-attention layer 2
  • the self-attention layer 2 is the previous level of the self-attention layer 3.
  • Table 1 The definitions of a and b on the three self-attention layers are shown in Table 1.
  • elements in one or more windows may be used to perform self-attention calculation on the first element.
  • step S420 includes: using elements in a window (ie, the first window) to perform self-attention calculation on the first element to obtain a representation of the first element.
  • step S420 includes: performing a self-attention calculation for the first element using elements contained in M windows to obtain a representation of the first element, where each of the M windows Each window includes one element or multiple consecutive elements in the input sequence, and at least one element is spaced between different windows.
  • the M windows include the first window, M is greater than 1, and the value of M is preset.
  • the M windows used for self-attention calculation of the first element are shown in Figs. 8, 9, 10, 11, and 12.
  • the input sequence is composed of elements 1 to 15.
  • the M windows used to perform self-attention calculations on the first element include window 1 containing elements 1, 2 and 3 and window 1 containing elements 11, 12, Window 2 of 13, there are 7 elements between window 1 and window 2.
  • the self-attention calculation is performed on the elements in the sequence by using more than one window, which can ensure the dependence range of the self-attention. It can be understood that for an element, the more windows for self-attention calculation, the larger the self-attention dependence range of the element. In the embodiment of the present application, the number of windows can be set reasonably to ensure the dependence range of the self-attention.
  • the value of M is preset, which means that the value of M has nothing to do with the length of the input sequence. In other words, the value of M may not increase as the length of the input sequence increases.
  • the length of the input sequence is L1 and the value of M is set to Q; suppose the length of the input sequence is L2 (L2>L1), and the value of M is still set to Q; suppose the length of the input sequence is L3 (L3 ⁇ L1), the value of M is still set to Q.
  • Q is equal to 2 or 3 or other integers greater than 1.
  • the number M of windows for performing self-attention calculation on an element has nothing to do with the length of the input sequence. Therefore, the problem that the calculation overhead in the prior art increases with the length of the input sequence can be avoided. Therefore, compared with the prior art, the amount of calculation for self-attention can be reduced.
  • one or more elements are spaced between different windows in the M windows used to perform self-attention calculation on an element, which can also reduce the amount of self-attention calculation.
  • the value of M is preset, that is, the embodiment of the present application can have a certain degree of control over the amount of self-attention calculation, so that the value of M can be set to reduce the calculation of self-attention. quantity.
  • the value of M can be determined according to application requirements. For example, the value of M can be set reasonably according to the current computing power. In the case of strong computing power, a larger value can be set for M; in the case of weak computing power, a smaller value can be set for M.
  • the embodiments of the present application can maximize the dependence range of self-attention on the premise that the calculation amount of self-attention does not exceed the computing ability.
  • the embodiment of the present application uses elements in multiple windows for calculation.
  • the number of the multiple windows has nothing to do with the length of the sequence, and there are gaps between different windows.
  • While reducing the amount of self-attention calculations try to take into account the range of self-attention, so as to achieve a balance between the amount of self-attention calculations and the range of dependence.
  • the following formula can be used to perform self-attention calculation on the first element h i to obtain a new representation h′ i of the first element:
  • S represents the elements contained in M windows
  • Attend() represents the calculation method of self-attention.
  • the calculation method of self-attention is based on the prior art, which will not be described in detail in this article.
  • the positions of the M windows can also be determined in a variety of ways. For example, the positions of the M windows are determined according to the position of the first element, or the positions of the M windows are preset and have nothing to do with the position of the first element.
  • the input sequence consists of element 1 to element 15.
  • the first element is element 7. It is assumed that the dependency length between element 7 and element 7 in the window for self-attention calculation of element 7 is greater than 3. , And less than 7, then the M windows for performing self-attention calculation on element 7 include window 1 and window 2.
  • the input sequence is the text sequence "He founded the famous special theory of relativity at the age of 26", and the first element is the element "Creation".
  • the self-attention calculation is set for the element "Creation”. If the dependency length between the elements in the window and the element "creation” is greater than 2 and less than 5, then the M windows for self-attention calculation of the element "creation” include window 1 and window 2.
  • the corresponding M windows may be determined in different ways.
  • the M windows include a third window, and the third window Contains elements in the input sequence that have a dependency length greater than or equal to ar and less than br, and br is a positive integer, and ar is a non-negative integer less than br.
  • the following formula can be used to perform self-attention calculation on the first element h i to obtain a new representation h′ i of the first element:
  • the input sequence is composed of elements 1 to 15.
  • the first element is the middle element in the input sequence: element 7.
  • M windows used for self-attention calculation of element 7 include Window 1 and Window 2 behind element 7.
  • the dependent length of the element contained in window 1 and element 7 is greater than 2 and less than 5
  • the dependent length of the element contained in window 2 and element 7 is greater than 6 and less than 9.
  • the M windows include the second window, and the second window Contains elements in the input sequence that have a dependency length greater than or equal to al and less than bl that are located in front of the first element in the input sequence, bl is a positive integer, and al is a non-negative integer less than bl.
  • the following formula can be used to perform self-attention calculation on the first element h i to obtain a new representation h′ i of the first element:
  • the input sequence is composed of elements 1 to 15.
  • the first element is the middle element in the input sequence: element 7,
  • M windows used for self-attention calculation of element 7 include Element 7 in front of window 1 and window 2.
  • the dependent length of the element contained in window 1 and element 7 is greater than 4 and less than 7
  • the dependent length of the element contained in window 2 and element 7 is greater than 1 and less than 4.
  • the M windows used for self-attention calculation of the first element may include a window located in front of the first element and a window located in the first element. The window behind the element.
  • the M windows include the second window and the third window .
  • the second window contains elements in the input sequence that have a dependency length greater than or equal to al and less than bl that are located in front of the first element, and bl is a positive integer, and al is a non-negative integer less than bl.
  • the third window contains elements in the input sequence that have a dependency length greater than or equal to ar and less than br, and br is a positive integer, and ar is a non-negative integer less than br.
  • the following formula can be used to perform self-attention calculation on the first element h i to obtain a new representation h′ i of the first element:
  • al and ar can be equal or unequal
  • bl and br can be equal or unequal
  • the input sequence is composed of elements 1 to 15.
  • the first element is the middle element in the input sequence: element 7,
  • M windows used for self-attention calculation of element 7 include Window 1 in front of element 7 and window 2 behind element 7, the dependency length of element in window 1 and element 7 is greater than 3 and less than 7, and the dependency length of element in window 2 and element 7 is also greater than 3 and less than 7.
  • the M windows used for self-attention calculation of the first element are multiple windows located behind the first element.
  • the M windows may further include a fourth window, and the fourth window includes the first element and its neighboring elements.
  • the input sequence is composed of elements 1 to 15, and the first element is the middle element in the input sequence: element 7, M windows used for self-attention calculation of element 7 include not only Window 1 and window 2, which do not contain element 7 and its adjacent elements, also include window 3. Window 3 contains element 7 and its adjacent elements: element 6 and element 8.
  • the position of multiple windows for performing self-attention calculation on the element is determined according to the position of the element, so that the dependent range of self-attention can be flexibly realized.
  • the positions of the M windows may also be preset. For example, it can be independent of the position of the first element.
  • the M windows used for self-attention calculation are all window 1 and window 2 shown in FIG. 8.
  • Figs. 8 to 12 are taken as examples to describe M windows used for self-attention calculation of the first element in the sequence. It should be noted that Figs. 8 to 12 are only examples and not limiting. . In practical applications, the value of M can be set according to application requirements to reduce the amount of self-attention calculations as much as possible, or the boundary of each of the M windows and different windows of the M windows can be set according to the application requirements In order to achieve a reasonable range of self-attention dependence.
  • the calculation is performed by using elements in multiple windows.
  • the number of the multiple windows has nothing to do with the length of the sequence, and there are The interval can reduce the calculation amount of self-attention, and at the same time take into account the dependence range of self-attention as much as possible, so as to achieve a balance between the calculation amount of self-attention and the dependence range.
  • the dependent range of self-attention can be flexibly realized.
  • the high-level self-attention layer can skip some elements that have been modeled in the previous layer, and the amount of calculation can be reduced.
  • Figures 13 and 14 show the situation where the self-attention mechanism provided by the embodiment of this application and the partial self-attention shown in Figure 3 are used to perform self-attention calculations on the same text sequence in three self-attention layer scenarios .
  • the text sequence is "He founded the famous special theory of relativity at the age of 26”
  • the self-attention layer (X-2) is the previous layer of the self-attention layer (X-1).
  • the force layer (X-1) is the previous layer of the self-attention layer X.
  • FIG. 13 is a schematic diagram of performing self-attention calculation on a text sequence using the self-attention mechanism provided by an embodiment of the present application.
  • the self-attention layer (X-2) the elements “year”, “foundation” and “famous” are used for calculation; in the self-attention layer (X- 1) In the above, the elements “Zai”, “26”, “ ⁇ ” and “narrow” are used for calculation; on the self-attention layer X, the elements “he” and “relativity” are used for calculation.
  • the self-attention layer (X-1) skips the element (element “ “Year”, “Foundation” and “Famous")
  • the self-attention layer X skips the elements used in the self-attention layer (X-1) (elements " ⁇ ", “26”, “ ⁇ ” and “narrow sense” ), which can reduce the amount of calculation.
  • Fig. 14 is a schematic diagram of performing self-attention calculation on a text sequence using the local self-attention mechanism shown in Fig. 3.
  • the self-attention layer (X-2) the elements “26", “year”, “foundation”, “famous” and “ ⁇ ” are used for calculation;
  • the self-attention layer (X-1) and the self-attention layer X the elements “26", “year”, “foundation”, “famous” and “ ⁇ ” are still used for calculation, which leads to multiple self-attention Repeated calculations between force layers.
  • the self-attention calculation of the element "created" by the self-attention layer (X-2), the self-attention calculation of the element "26" by the self-attention layer (X-1), the self-attention layer X's calculation of the self-attention of the element "he” only realizes the establishment of the relationship between the elements "he” and " ⁇ " in the sequence. In other words, through the processing of 3 self-attention layers, a dependency of length 6 is realized.
  • the self-attention mechanism provided by the embodiment of the present application can model a longer-distance dependence than the existing local self-attention mechanism.
  • the sequence processing method provided in the embodiments of the present application can be applied to a voice processing system.
  • the voice processing system is a voice recognition system.
  • the input sequence in the method 400 provided in the foregoing embodiment is a speech sequence.
  • the sequence processing method provided in the embodiment of the present application can also be applied to a natural speech processing system.
  • the natural speech processing system is any one of the following systems: translation system, natural language understanding (NLU) system based on the BERT model.
  • NLU natural language understanding
  • the input sequence in the method 400 provided in the foregoing embodiment is a speech sequence.
  • FIG. 15 is a schematic block diagram of a sequence processing apparatus 1500 provided by an embodiment of this application.
  • the device 1500 includes an input unit 1510, a processing unit 1520, and an output unit 1530.
  • the input unit 1510 is configured to receive an input sequence and input the input sequence into the processing unit 1520.
  • the input sequence includes a plurality of elements in a sequence.
  • the processing unit 1520 is configured to perform self-attention calculations on the first element in the input sequence using the elements contained in the M windows to obtain a representation of the first element, where each of the M windows contains the input sequence One element of or multiple consecutive elements, and at least one element is spaced between different windows, at least one of the M windows does not contain the first element, and M is an integer greater than or equal to 1.
  • the output unit 1530 is configured to obtain an output sequence corresponding to the input sequence based on the representation of the first element.
  • the processing unit 1520 is further configured to, and the processing unit is further configured to determine M windows according to the position of the first element in the input sequence.
  • the M windows include the first window, and the first The window contains elements in the input sequence whose dependent length with the first element is greater than or equal to a and less than b, where a is an integer greater than 1, and b is an integer greater than a, and the dependent length represents the relationship between the first element and the M The distance between elements in each window.
  • the device 1500 is applied to multiple self-attention layers, and the input sequence is the sequence output by the previous self-attention layer of the current self-attention layer; the processing unit 1520 is also used to, according to the first self-attention layer The position of an element in the input sequence determines the first window.
  • the first window contains elements in the input sequence whose dependency length with the first element is greater than or equal to a and less than b.
  • the values of b and a are set to such that The self-attention calculation of the current self-attention layer for the first element and the self-attention calculation of the first-level self-attention layer for the first element are not double-calculated.
  • the previous self-attention layer performs self-attention calculations on the first element based on the elements contained in the fifth window
  • the fifth window contains a sequence that has a dependency length greater than or For elements equal to a1 and less than b1, b1 is a positive integer, and a1 is a non-negative integer less than b1
  • the processing unit 1520 is further configured to determine the first window according to the position of the first element in the input sequence, and the first window contains The element whose dependency length with the first element in the input sequence is greater than or equal to a and less than b, where the value of a is greater than the value of b1.
  • M is equal to 1
  • the processing unit 1520 is configured to perform self-attention calculation on the first element in the input sequence using the elements contained in the first window to obtain a representation of the first element, wherein, the first window contains one element or consecutive multiple elements in the input sequence, but does not contain the first element.
  • M is greater than 1, and the value of M is preset.
  • the M windows include the second window and/or the third window.
  • the second window contains elements in the input sequence that have a dependency length greater than or equal to al and less than bl that are located in front of the first element, and bl is a positive integer, and al is a non-negative integer less than bl.
  • the third window contains elements in the input sequence that have a dependency length greater than or equal to ar and less than br, and br is a positive integer, and ar is a non-negative integer less than br.
  • the M windows include a fourth window, and the fourth window includes the first element and its neighboring elements.
  • the input sequence is a speech sequence or a text sequence.
  • the sequence processing apparatus 1500 provided in the embodiment of the present application may also be referred to as a sequence processing apparatus.
  • the sequence processing device may also include processing modules of other neural network layers.
  • an embodiment of the present application also provides an apparatus 1600 for sequence processing.
  • the device 1600 includes a processor 1610, the processor 1610 is coupled with a memory 1620, the memory 1620 is used to store computer programs or instructions, and the processor 1610 is used to execute the computer programs or instructions stored in the memory 1620, so that the method in the above method embodiment Be executed.
  • the apparatus 1600 may further include a memory 1620.
  • the device 1600 may further include a data interface 1630, and the data interface 1630 is used for data transmission with the outside world.
  • an embodiment of the present application further provides a neural network processing device 1700, including an input module 1710, a processing module 1720, an output module 1730, and a sequence processing device 1500 provided in the embodiment of the present application.
  • the input module 1710 is used to transfer the input sequence to be processed to the device 1500 for sequence processing.
  • the input module 1710 may further include a feature extraction unit for extracting feature data from the data to be processed, and the feature data is used as the input of the device 1500 for sequence processing.
  • the sequence processing device 1500 is used to perform self-attention calculation on an input sequence to obtain an output sequence corresponding to the input sequence.
  • the processing module 1720 is used to process the output sequence obtained by the device 1500 to obtain a sequence processing result.
  • the output module 1730 is configured to output an output signal based on the sequence processing result obtained by the processing module 1720.
  • the input module 1710 is used to transfer the voice sequence to the sequence processing device 1500; the processing module 1720 is used to perform voice recognition processing on the output sequence obtained by the sequence processing device 1500 to obtain voice recognition result.
  • the neural network processing device 1700 may be referred to as a voice processing system.
  • the input module 1710 is used to pass the text sequence to be processed to the sequence processing apparatus 1500; the processing module 1720 is used to perform semantic understanding processing on the output sequence obtained by the sequence processing apparatus 1500 to obtain semantics Understand the results.
  • the neural network processing device 1700 may be referred to as a natural language processing system.
  • an efficient sequence data processing system can be constructed.
  • FIG. 18 is a schematic block diagram of a speech recognition system 1800 to which an embodiment of the application can be applied.
  • the speech recognition system 1800 can be used for real-time speech recognition.
  • the speech recognition system 1800 includes an input module 1810, a recognizer module 1820, and an output module 1830.
  • the recognizer module 1820 is a neural network including a self-attention layer, where at least one self-attention layer included in the recognizer module 1820 adopts the self-attention mechanism provided in the embodiment of the present application, that is, the method 400 provided in the above embodiment is adopted. Process the input sequence.
  • the input module 1810 is used to receive the data to be processed, and obtain the input of the recognizer module 1820 based on the data to be processed, that is, the input sequence.
  • the input module 1810 may include an acoustic feature extraction unit.
  • the acoustic feature extraction unit is used to perform feature extraction on the input data to be processed to obtain feature data.
  • the feature data extracted by the acoustic feature extraction unit is the input of the recognizer module 1820.
  • the recognizer module 1820 is used to perform voice recognition processing on the sequence input by the input module 1810 to obtain a voice recognition result.
  • the recognizer module 1820 includes a self-attention module 1821 and other neural network modules 1822.
  • the self-attention module 1821 includes the following structures: a batch normalization layer, a self-attention layer, a residual connection (residual), and an FFN layer. At least one self-attention layer included in the self-attention module 1821 adopts the self-attention mechanism provided in the embodiment of the present application, that is, the method 400 provided in the above embodiment is used to process the input sequence.
  • Residual connection is a neural network connection method, which generally refers to adding the output of the current layer and the output of a previous layer as the output.
  • Batch normalization is a method of normalizing the median value of a neural network.
  • the FFN layer is, for example, Position-wise FFN. Position-wise FFN means that the same FFN is used for each position in the sequence.
  • the FFN has two layers.
  • the activation function of the first layer is ReLU, and the second layer has no activation function.
  • ReLU is an activation function of a neural network.
  • the self-attention module 1821 may be stacked N times.
  • Other neural network modules 1822 may include a convolution block (Convolution block).
  • the convolution module can be repeatedly stacked M times.
  • the other neural network module 1822 may be ConvBlock.
  • ConvBlock refers to the structure where the convolution layer is followed by the batch normalization layer and then the ReLU.
  • the recognizer module 1820 can also be stacked K times.
  • Convolution Convolution
  • batch normalization batch normalization
  • FFN feed normalization
  • ReLU ReLU
  • the output module 1830 is configured to output an output signal based on the speech recognition result obtained by the recognizer module 1820.
  • the output signal is a sequence of characters.
  • the output module 1830 includes the following structures: layer normalization (layer normalization layer norm) and output feedforward neural network (output ffn).
  • Feedforward neural network is a type of neural network.
  • the speech recognition system 1800 provided in the embodiments of the present application applies the self-attention mechanism provided in the embodiments of the present application, so that the amount of calculation of self-attention can be reduced, and at the same time, the dependence range of self-attention can be ensured, thereby Efficient processing of sequence data can be realized.
  • the embodiments of the present application also provide a computer-readable medium that stores program code for device execution, and the program code includes a method for executing the above-mentioned embodiments.
  • the embodiments of the present application also provide a computer program product containing instructions, which when the computer program product runs on a computer, cause the computer to execute the method of the foregoing embodiment.
  • An embodiment of the present application also provides a chip, which includes a processor and a data interface, and the processor reads instructions stored on the memory through the data interface, and executes the method of the foregoing embodiment.
  • the chip may further include a memory in which instructions are stored, and the processor is used to execute the instructions stored in the memory, and when the instructions are executed, the processor is used to execute the method in the foregoing embodiment.
  • FIG. 19 is a chip hardware structure provided by an embodiment of the application, and the chip includes a neural network processor 1900.
  • the chip can be installed in any one or more of the following devices:
  • the method 400 in the above method embodiment can be implemented in a chip as shown in FIG. 19.
  • the neural network processor 1900 is mounted on a host CPU as a coprocessor, and the host CPU distributes tasks.
  • the core part of the neural network processor 1900 is the arithmetic circuit 1903.
  • the controller 1904 controls the arithmetic circuit 1903 to obtain data in the memory (weight memory 1902 or input memory 1901) and perform calculations.
  • the arithmetic circuit 1903 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 1903 is a two-dimensional systolic array. The arithmetic circuit 1903 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1903 is a general-purpose matrix processor.
  • the arithmetic circuit 1903 fetches the data corresponding to matrix B from the weight memory 1902 and caches it on each PE in the arithmetic circuit 1903.
  • the arithmetic circuit 1903 fetches the matrix A data and matrix B from the input memory 1901 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 1908.
  • the vector calculation unit 1907 can perform further processing on the output of the arithmetic circuit 1903, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 1907 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • the vector calculation unit 1907 can store the processed output vector in a unified memory (also referred to as a unified buffer) 1906.
  • the vector calculation unit 1907 may apply a non-linear function to the output of the arithmetic circuit 1903, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 1907 generates normalized values, combined values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 1903, for example for use in subsequent layers in a neural network.
  • the method 400 in the above method embodiment may be executed by 1903 or 1907.
  • the unified memory 1906 is used to store input data and output data.
  • the input data in the external memory can be transferred to the input memory 1901 and/or the unified memory 1906 through the storage unit access controller 1905 (direct memory access controller, DMAC), the weight data in the external memory can be stored in the weight memory 1902, and the The data in the unified memory 1906 is stored in the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (BIU) 1910 is used to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 1909 through the bus.
  • An instruction fetch buffer 1909 connected to the controller 1904 is used to store instructions used by the controller 1904;
  • the controller 1904 is used to call the instructions cached in the memory 1909 to control the working process of the computing accelerator.
  • the unified memory 1906, the input memory 1901, the weight memory 1902, and the instruction fetch memory 1909 are all on-chip memories.
  • the external memory is a memory external to the NPU.
  • the external memory can be a double data rate synchronous dynamic random access memory.
  • Memory double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (HBM) or other readable and writable memory.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: Universal Serial Bus flash disk (USB flash disk, UFD) (UFD can also be referred to as U disk or USB flash drive for short), mobile hard disk, read-only memory (read-only memory, ROM), random access Various media that can store program codes, such as random access memory (RAM), magnetic disks, or optical disks.
  • USB flash disk UFD
  • UFD Universal Serial Bus flash disk
  • ROM read-only memory
  • RAM random access memory
  • magnetic disks magnetic disks
  • optical disks optical disks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种序列处理的方法与装置,涉及人工智能领域,具体涉及序列数据处理领域。该方法包括:接收输入序列(S410);对输入序列中的第一元素,使用M个窗口内包含的元素进行自注意力计算,获得第一元素的表示,每个窗口内包含输入序列中的一个元素或连续的多个元素,且不同窗口之间至少间隔一个元素,M个窗口中至少一个窗口内不包含第一元素,M为大于或等于1的整数(S420);基于第一元素的表示,获得输入序列对应的输出序列(S430)。对于序列中元素,通过使用一个或多个窗口内的元素而非序列中所有元素进行自注意力计算,可以减小自注意力的计算量;其中至少一个窗口可以跳过第一元素,且该窗口的位置不固定,可以减小对自注意力的依赖范围的限制。

Description

序列处理的方法与装置
本申请要求于2020年05月26日提交中国国家知识产权局、申请号为202010454695.6、申请名称为“序列处理的方法与装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,具体涉及一种序列处理的方法与装置。
背景技术
语音处理和自然语言处理(natural language processing,NLP)的很多问题,都可以看成序列处理问题。序列由若干存在先后顺序的元素组成。例如,语音数据可以表示成采样点为元素的序列。又例如,文本数据可以表示成以词为元素的序列。序列中的某个元素,其代表的意义往往与序列中其他元素存在关系。如何建模序列中元素之间的关系,是序列处理问题的关键。当前,建模序列中元素之间的关系的方法有循环神经网络(recurrent neural network,RNN)、卷积神经网络(convolutional neural networks,CNN)与自注意力(self-attention)。其中,自注意力是一种通过建立序列中某个元素与序列中其它元素的关系来得到这个元素的表示的方法。
传统的自注意力的方法是,针对一个元素,建立这个元素与序列中所有元素的关系,这导致自注意力的计算量很大。为了减小自注意力的计算量,当前技术提出的解决方案为,针对一个元素,固定使用该元素附近的几个元素进行自注意力计算,但是,该方案会产生自注意力的依赖范围受到限制的问题。
如何平衡自注意力的计算量与依赖范围,是需要解决的问题。
发明内容
本申请提供一种序列处理的方法与装置,可以较好地平衡自注意力的计算量与依赖范围。
第一方面,提供了一种序列处理的方法,所述方法包括:接收输入序列,所述输入序列包括多个具有先后顺序的元素;对所述输入序列中的第一元素,使用M个窗口内包含的元素进行自注意力计算,获得所述第一元素的表示,其中,所述M个窗口中每个窗口内包含所述输入序列中的一个元素或连续的多个元素,且不同窗口之间至少间隔一个元素,所述M个窗口中至少一个窗口内不包含所述第一元素,M为大于或等于1的整数;基于所述第一元素的表示,获得所述输入序列对应的输出序列。
M等于1时,表示,可以使用一个窗口(记为第一窗口)内的元素对第一元素进行自注意力计算,第一窗口包含所述输入序列中除所述第一元素之外的一个元素或多个连续的元素,换句话说,第一窗口跳过第一元素,包含输入序列中其他的一个元素或多个连续的 元素。可选地,第一窗口内也不包括第一元素的相邻元素。
第一窗口的位置是可以灵活配置的,而不是固定的。只要跳过第一元素(或者,还有其相邻元素),第一窗口可以位于输入序列上的任何位置。
第一窗口的大小,即第一窗口内包含的元素的数量也是可以配置的,不是固定的。
在本申请中,对于序列中的第一元素,可以使用第一窗口内的元素而非序列中的所有元素进行自注意力计算,这可以减小自注意力的计算量。
现有技术在获取序列中某个元素的表示时,固定选取该元素附近的几个元素进行自注意力计算,这导致自注意力的依赖范围受到限制。在本申请中,基于第一窗口内的元素对序列中的第一元素进行自注意力计算,因为该第一窗口可以跳过第一元素及其相邻元素,且该第一窗口的位置可以不固定,因此,相对于现有技术可以减小对自注意力的依赖范围的限制。
因此,本申请实施例可以较好地平衡自注意力的计算量与依赖范围。
结合第一方面,在第一方面的一种可能的实现方式中,所述方法还包括:根据所述第一元素在所述输入序列中的位置,确定所述M个窗口,所述M个窗口中包括第一窗口,所述第一窗口包含所述输入序列中与所述第一元素的依赖长度大于或等于a,且小于b的元素,其中,a为大于1的整数,b为大于a的整数,所述依赖长度表示所述第一元素与所述M个窗口内的元素之间的距离。
实际应用中,可以根据应用需求灵活配置a与b的取值,以合理确定第一窗口的位置,从而选择合理的自注意力依赖范围。
自注意力的依赖范围表示,针对一个元素,与它建立关系(即进行自注意力计算)的其他元素与该元素之间的依赖长度的范围。该依赖长度表示该元素与其他元素之间的距离。
可选地,在本实现方式中,所述方法应用于多个自注意力层,所述输入序列是当前自注意力层的前一级自注意力层输出的序列;其中,b与a的取值被设置为,使得当前自注意力层对所述第一元素的自注意力计算与所述前一级自注意力层对所述第一元素的自注意力计算没有重复计算。
假设所述前一级自注意力层基于第五窗口内包含的元素对所述第一元素进行自注意力计算,所述第五窗口包含所述序列中与所述第一元素的依赖长度大于或等于a1,且小于b1的元素,b1为正整数,a1为小于b1的非负整数;其中,a的取值大于b1的取值。
在本申请中,通过根据序列中第一元素的位置确定用于对第一元素进行自注意力计算的窗口的位置,使得可以灵活地选择第一元素的自注意力依赖范围,因此可以进一步地减小对自注意力的依赖范围的限制。
结合第一方面,在第一方面的一种可能的实现方式中,第一窗口的位置可以预设。
结合第一方面,在第一方面的一种可能的实现方式中,M大于1,且M的取值是预设的。
M大于1,表示,可以使用多个窗口内的元素对第一元素进行自注意力计算。
M的取值是预设的,表示,M的取值与输入序列的长度无关。也就是说,M的取值可以不随输入序列长度的增大而增大。
在本申请中,通过使用大于1个的窗口对序列中的元素进行自注意力计算,这可以保 证自注意力的依赖范围。可以理解到,针对一个元素,进行自注意力计算的窗口越多,该元素的自注意力依赖范围越大。本申请实施例可以通过合理设置窗口的数量,来保证自注意力的依赖范围。
此外,对一个元素进行自注意力计算的窗口的个数M与输入序列的长度无关,因此,可以避免现有技术中存在的计算开销随输入序列的长度呈平方增长的问题,因此,相对于现有技术可以减小自注意力的计算量。此外,M个窗口中不同窗口之间间隔一个或多个元素,这也可以减小自注意的计算量。
结合第一方面,在第一方面的一种可能的实现方式中,所述M个窗口中包括第二窗口,和/或第三窗口。
第二窗口,所述第二窗口包含所述输入序列中位于所述第一元素前面的与所述第一元素的依赖长度大于或等于al,且小于bl的元素,bl为正整数,al为小于bl的非负整数。
第三窗口,所述第三窗口包含所述输入序列中位于所述第一元素后面的与所述第一元素的依赖长度大于或等于ar,且小于br的元素,br为正整数,ar为小于br的非负整数。
在所述M个窗口中包括第二窗口和第三窗口的情况下,al与ar可以相等或不相等,bl与br可以相等或不相等。
结合第一方面,在第一方面的一种可能的实现方式中,所述M个窗口中包括第四窗口,所述第四窗口包含所述第一元素及其相邻元素。
结合第一方面,在第一方面的一种可能的实现方式中,所述输入序列为语音序列或文本序列。
第二方面,提供一种序列处理的装置,所述装置包括接收单元、处理单元与输出单元。
所述接收单元,用于接收输入序列,所述输入序列包括多个具有先后顺序的元素。所述处理单元,用于对所述输入序列中的第一元素,使用M个窗口内包含的元素进行自注意力计算,获得所述第一元素的表示,其中,所述M个窗口中每个窗口内包含所述输入序列中的一个元素或连续的多个元素,且不同窗口之间至少间隔一个元素,所述M个窗口中至少一个窗口内不包含所述第一元素,M为大于或等于1的整数。所述输出单元,用于基于所述第一元素的表示,获得所述输入序列对应的输出序列。
结合第二方面,在第二方面的一种可能的实现方式中,所述处理单元还用于,根据所述第一元素在所述输入序列中的位置,确定所述M个窗口,所述M个窗口中包括第一窗口,所述第一窗口包含所述输入序列中与所述第一元素的依赖长度大于或等于a,且小于b的元素,其中,a为大于1的整数,b为大于a的整数,所述依赖长度表示所述第一元素与所述M个窗口内的元素之间的距离。
结合第二方面,在第二方面的一种可能的实现方式中,所述装置应用于多个自注意力层,所述输入序列是当前自注意力层的前一级自注意力层输出的序列;其中,b与a的取值被设置为,使得当前自注意力层对所述第一元素的自注意力计算与所述前一级自注意力层对所述第一元素的自注意力计算没有重复计算。
结合第二方面,在第二方面的一种可能的实现方式中,所述前一级自注意力层基于第五窗口内包含的元素对所述第一元素进行自注意力计算,所述第五窗口包含所述序列中与所述第一元素的依赖长度大于或等于a1,且小于b1的元素,b1为正整数,a1为小于b1的非负整数;其中,a的取值大于b1的取值。
结合第二方面,在第二方面的一种可能的实现方式中,M大于1,且M的取值是预设的。
结合第二方面,在第二方面的一种可能的实现方式中,所述M个窗口中包括第二窗口,和/或第三窗口。第二窗口与第三窗口的描述详见前文,这里不再赘述。
结合第二方面,在第二方面的一种可能的实现方式中,所述M个窗口中包括第四窗口,所述第四窗口包含所述第一元素及其相邻元素。
结合第二方面,在第二方面的一种可能的实现方式中,所述输入序列为语音序列或文本序列。
第三方面,提供一种神经网络处理装置,包括输入模块、处理模块、输出模块以及如权利要求9-16中任一项所述的序列处理的装置。所述输入模块用于,将输入序列输入所述序列处理的装置;所述序列处理的装置用于,对所述输入序列进行自注意力计算,获得所述输入序列对应的输出序列;所述处理模块用于,对所述输出序列进行处理,获得序列处理结果;所述输出模块,用于基于所述处理模块获得的序列处理结果输出输出信号。其中,在所述输入序列为语音序列的情况下,所述处理模块用于对所述输出序列进行语音识别处理,获得语音识别结果;或在所述输入序列为文本序列的情况下,所述处理模块用于对所述输出序列进行语义理解处理,获得语义理解结果。
第四方面,提供一种数据处理的装置,该装置包括:存储器,用于存储程序;处理器,用于执行存储器存储的程序,当存储器存储的程序被执行时,处理器用于执行上述第一方面中的方法。
第五方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行上述第一方面中的方法。
第六方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面中的方法。
第七方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行上述第一方面中的方法。
基于上述描述,在本申请提供的方案中,可以基于第一窗口内的元素对序列中的第一元素进行自注意力计算,因为该第一窗口可以跳过第一元素及其相邻元素,且该第一窗口的位置可以不固定,因此,相对于现有技术可以减小对自注意力的依赖范围的限制。
此外,本申请在获得序列中的一个元素的表示时,可以使用多个窗口内的元素进行自注意力计算,该多个窗口的个数与序列的长度无关,且不同窗口之间具有间隔,可以在减小自注意力的计算量的同时,尽量兼顾自注意力的依赖范围,从而可以实现自注意力的计算量与依赖范围的平衡。
附图说明
图1为自注意力机制的示意图。
图2为包含自注意力层的神经网络的架构示意图。
图3为局部自注意力机制的示意图。
图4为本申请实施例提供的序列处理的方法的示意性流程图。
图5为本申请实施例中用于对序列中第一元素进行自注意力计算的窗口的示意图。
图6为本申请实施例提供的序列处理的方法的另一示意性流程图。
图7为本申请实施例应用于多个自注意力层场景中时对序列中第一元素进行自注意力计算的窗口的示意图。
图8至图12为本申请实施例中用于对序列中第一元素进行自注意力计算的M个窗口的示意图。
图13为本申请实施例应用于多个自注意力层场景中时对序列中元素进行自注意力计算的示意图。
图14为在多个自注意力层场景中采用局部自注意力机制的示意图。
图15为本申请实施例提供的序列处理的装置的示意性框图。
图16为本申请实施例提供的序列处理的装置的另一示意性框图。
图17为本申请实施例提供的神经网络处理装置的示意性框图。
图18为本申请实施例提供的语音识别***的示意性框图。
图19本申请实施例提供的一种芯片硬件结构示意图。
具体实施方式
语音处理和自然语言处理(natural language processing,NLP)的很多问题,都可以看成序列数据(sequence data)的处理问题,可以简称为序列处理问题。
例如,在自然语音处理的***中,输入的一句话可以表示成一个词序列。如图1所示,“他在26岁创立著名的狭义相对论”这句话被表示成一个词序列。词序列也可称为文本序列。又例如,在语音识别***中,一段连续的语音被分割成时间相等的帧,即可表示成一个帧序列。帧序列也可称为语音序列。
序列由若干元素组成,且元素之间存在先后顺序。例如,语音数据可以表示成以采样点为元素的序列。又例如,文本数据可以表示成以词为元素的序列。例如,在图1的例子中,“他”、“在”、“26”、“岁”、“创立”、“著名”、“的”、“狭义”与“相对论”分别是文本序列“他在26岁创立著名的狭义相对论”中的元素。
序列中的某个元素,其代表的意义往往与序列中其他元素存在关系。例如,在图1的例子中,元素“他”与元素“创立”在语法上是主谓关系。如何建模序列中元素之间的关系,是序列处理问题的关键。
当前技术中,建模序列中元素之间的关系的方法有循环神经网络(recurrent neural network,RNN)、卷积神经网络(convolutional neural networks,CNN)与自注意力(self-attention)。其中,自注意力是一种通过建立序列中某个元素与序列中其它元素的关系来获得这个元素的表示的方法。或者说,自注意力是一种用来建模序列中元素间的关系进而得到更好的元素表示的方法。针对一个元素,相对于该元素在进行自注意力计算之前的表示,通过自注意力计算之后得到表示可以称为该元素的新的表示。
自注意力可以作为神经网络中的一个层。包括自注意力层的神经网络也可以称为输入序列处理器。图2示出输入序列处理器的示意性框图。输入序列处理器为包括自注意力层 的神经网络,该神经网络中还可以包括其他神经网络层。
例如,待处理的序列被输入序列处理器,自注意力层对该序列进行自注意力操作,获得该序列中各个元素的新的表示,从而获得新的序列,该新的序列被输入其他神经网络层进行处理,最终获得序列处理结果,即序列处理器输出序列处理结果。例如,待处理的序列为文本序列,序列处理器输出的序列处理结果可以是语义理解结果或机器翻译结果等文本处理结果。再例如,待处理的序列为语音序列,序列处理器输出的序列处理结果可以是语音识别结果等语音处理结果。
需要说明的是,图2仅为示例而非限定。例如,待处理的序列可以经过特征提取模块的处理后再被输入序列处理器。又例如,序列处理器中可以包括一个或多个自注意力层,在包括多个自注意力层的场景中,在两个自注意力层之间可以包含其他神经网络层。包括自注意力层的神经网络的架构设计为现有技术,本文不作详述。
在传统的自注意力方法中,针对一个元素,建立这个元素与序列中所有元素的关系,也就是说,针对一个元素,使用序列中所有元素来进行自注意力计算。
如图1所示,对于文本序列“他在26岁创立著名的狭义相对论”,在计算序列中的元素“创立”的表示时,会选择序列中所有元素进行自注意力计算。例如,计算元素“创立”与序列中其他所有元素的分数,这个分数代表“创立”跟其他元素是否存在某种关系,分数越高代表存在这种关系的可能性越高。
传统自注意力在数学上的描述如下。
假设一个序列表示为H={h 1,h 2,...h i,...,h L},h i表示序列H中的元素。例如,每个元素h i使用宽度为d的向量表示。使用自注意力建模元素h i与序列中的其他元素的关系,并得到这个元素h i的新的表示h′ i的过程可表示如下。
h′ i=Attend(h i,S)
其中S=H,Attend()表示自注意力的计算方式。
自注意力的计算方式包括多种方式。例如,一种自注意力的计算方式如下所示。
Figure PCTCN2021073868-appb-000001
其中,Q()、K()与V()分别通常为一个线性映射。d表示用于表示元素的向量的宽度,即序列中每个元素分别使用宽度为d的向量表示。Softmax()表示归一化指数函数。自注意力的计算方式为现有技术,本文不作详述。
应理解,采用上述公式所示的方式对一个序列进行自注意力计算,其中,单个元素的自注意力的计算量为O(Ld),整个序列的计算量为O(L 2d)。可知,采用上述公式所示的方式对一个序列进行自注意力计算,计算开销会随着输入的序列的长度呈平方增长,则在处理长序列时,往往存在计算量过大的问题。
为了减小自注意力的计算量,当前技术中提出局部自注意力(也称为截断自注意力)的方案。在局部自注意力的方案中,在计算序列中的某个元素的表示时,仅选取该元素附近的几个元素而非序列中所有元素进行自注意力计算。如图3所示,文本序列为“他在26岁创立著名的狭义相对论”,在计算元素“创立”的表示时,仅选取元素“创立”附近的元素“26”、“岁”、“创立”、“著名”、“的”进行自注意力计算。
但是,局部自注意力的方案会产生自注意力的依赖范围受到限制的问题。
自注意力的依赖范围表示,针对一个元素,与它建立关系(即进行自注意力计算)的其他元素与该元素之间的依赖长度的范围。该依赖长度表示该元素与其他元素之间的距离。例如,在图3的例子中,针对元素“创立”,假设将其与自身即“创立”之间的依赖长度记为0,则元素“创立”与元素“岁”之间的依赖长度为1(同理,与元素“著名”之间的依赖长度也为1),元素“创立”与元素“26”之间的依赖长度为2(同理,与元素“的”之间的依赖长度也为2)。即,在图3的例子中,对元素“创立”进行自注意力计算时的依赖范围是0~2。
上述可知,现有技术无法平衡自注意力的依赖范围与计算量。
针对上述问题,本申请提出一种序列处理的方法与装置,可以较好地实现自注意力的计算量与依赖范围的平衡。
图4为本申请实施例提供的序列处理的方法400的示意性流程图。方法400包括步骤S410、步骤S430与步骤S430。
S410,接收输入序列,输入序列包括多个具有先后顺序的元素。
该输入序列表示待进行自注意力处理的序列。
作为一个示例,采用图2中所示的自注意力层执行本方法400,该输入序列可以为该自注意力层的前一个神经网络层输出的序列。
例如,该输入序列可以是语音序列。例如,在语音识别***中,一段连续的语音被分割成时间相等的帧,所形成的帧序列可以称为语音序列。例如,语音序列为元素为采样点的序列。
又例如,该输入序列可以是文本序列。例如,在自然语音处理的***中,输入的一句话可以表示成一个词序列。如图1所示,“他在26岁创立著名的狭义相对论”这句话被表示成一个词序列。词序列也可称为文本序列。文本序列为元素为词的序列。
S420,对输入序列中的第一元素,使用M个窗口内包含的元素进行自注意力计算,获得第一元素的表示,其中,M个窗口中每个窗口内包含输入序列中的一个元素或连续的多个元素,且不同窗口之间至少间隔一个元素,M个窗口中至少一个窗口内不包含第一元素,M为大于或等于1的整数。
第一元素表示输入序列中任一个元素。如前文描述,对一个序列的自注意力处理包括对该序列中每个元素的自注意力计算。在本申请实施例中,自然也是要对输入序列中每个元素进行自注意力计算,从而获得对应元素的表示。考虑到序列中每个元素的自注意力计算方式是类似的,也为了便于理解与描述,本申请实施例中以第一元素为例进行描述。第一元素表示输入序列中的任一个元素。换句话说,对于输入序列中任一个元素,均采用步骤S420的方式对该元素进行自注意力计算,以获得该元素的表示。
M等于1时,表示,可以使用一个窗口(记为第一窗口)内的元素对第一元素进行自注意力计算,第一窗口包含所述输入序列中除所述第一元素之外的一个元素或多个连续的元素,换句话说,第一窗口跳过第一元素,包含输入序列中其他的一个元素或多个连续的元素。
可选地,第一窗口内也不包括第一元素的相邻元素。
第一元素的相邻元素包括与第一元素相邻的元素。
例如,在图5的示例中,第一元素为元素7,第一元素的相邻元素包括前面相邻的元 素6与右边相邻的元素8。
M大于1时,表示,可以使用多个窗口内的元素对第一元素进行自注意力计算。下文将描述M大于1时的情形。
下面先以在步骤S420中使用第一窗口内的元素对第一元素进行自注意力计算为例进行描述。
第一窗口的位置是可以灵活配置的,而不是固定的。只要跳过第一元素(或者,还有其相邻元素),第一窗口可以位于输入序列上的任何位置。
例如,第一窗口位于第一元素的前面。又例如,第一窗口位于第一元素的后面。
在第一元素为输入序列中的首个元素的情况下,第一窗口位于第一元素的后面;在第一元素为输入序列中的最后一个元素的情况下,第一窗口位于第一元素的前面;在第一元素为输入序列中的中间元素的情况下,第一窗口可以位于第一元素的前面或后面。
实际应用中,可以根据应用需求合理确定第一窗口的位置。
第一窗口的大小,即第一窗口内包含的元素的数量也是可以配置的,不是固定的。
例如,第一窗口包含1个、2个、3个或更多数量的元素。
应用中,可以根据应用需求合理配置第一窗口的大小。
作为一个示例,如图5所示,输入序列由元素1至元素15组成,第一元素为元素7,第一窗口可以为图5中所示的窗口1、窗口2与窗口3中的任一个窗口。
例如,可以使用如下公式对第一元素h i进行自注意力计算,获得第一元素的新的表示h′ i
h′ i=Attend(h i,S)
其中,S表示用于对第一元素进行自注意力计算的元素的集合,S中包括第一窗口内的元素。Attend()表示自注意力的计算方式。自注意力的计算方式为现有技术,本文不作详述。
S430,基于第一元素的表示,获得输入序列对应的输出序列。
应理解,在步骤S430中,基于输入序列中每个元素的表示,获得该输出序列。其中,对于输入序列中每个元素,均通过步骤S420的方式获取对应元素的表示。
作为一个示例,采用图2中所示的自注意力层执行本方法400,步骤S430获得的输出序列可以被传递到该自注意力层的下一个神经网络层进行后续处理。
在本申请实施例中,对于序列中的第一元素,基于第一窗口内的元素而非序列中的所有元素进行自注意力计算,这可以减小自注意力的计算量。
此外,如前文描述,在如图3所示的现有技术中,在获取序列中某个元素的表示时,固定选取该元素附近的几个元素进行自注意力计算,这导致自注意力的依赖范围受到限制。
在本申请实施例中,基于第一窗口内的元素对序列中的第一元素进行自注意力计算,因为该第一窗口可以跳过第一元素及其相邻元素,且该第一窗口的位置可以不固定,因此,相对于现有技术可以减小对自注意力的依赖范围的限制。
因此,本申请实施例可以较好地平衡自注意力的计算量与依赖范围。
本申请实施例提供的自注意力机制中,用来进行自注意力计算的窗口不是固定的,可 以动态变化,因此,本申请实施例提供的自注意力机制可以称为跳跃自注意力。
第一窗口在输入序列上的位置可以通过多种方式确定。
第一种方式,第一窗口的位置是根据第一元素的位置确定的。
例如,设置第一窗口包含输入序列中与第一元素的依赖长度大于或等于a,且小于b的元素,该依赖长度表示第一元素与第一窗口内的元素之间的距离,其中,a为大于1的整数,b为大于a的整数。应理解,b的取值小于输入序列的长度。
可选地,如图6所示,在图4所示实施例中,方法400还可以包括步骤S440。
S440,根据第一元素在输入序列中的位置,确定第一窗口,第一窗口包含输入序列中与第一元素的依赖长度大于或等于a,且小于b的元素,其中,a为大于1的整数,b为大于a的整数。在步骤S420中,使用第一窗口内的元素对第一元素进行自注意力计算,获得第一元素的表示。
以仅使用第一窗口内的元素获取第一元素的新的表示为例,可以使用如下公式对第一元素h i进行自注意力计算,获得第一元素的新的表示h′ i
h′ i=Attend(h i,S)
其中,S={h i|i-b≤j≤i-a},Attend()表示自注意力的计算方式。
应理解,通过设置a与b的取值,可以灵活选择第一元素的依赖范围。
继续参见图5,若设置第一窗口包含输入序列中与第一元素(即图5中的元素7)的依赖长度大于1且小于4的元素,则第一窗口可以为窗口1。若设置第一窗口包含输入序列中与第一元素的依赖长度大于1且小于5的元素,则第一窗口还是窗口1。若设置第一窗口包含输入序列中与第一元素的依赖长度大于2且小于6(或7,或8)的元素,则第一窗口为窗口2。若设置第一窗口包含输入序列中与第一元素的依赖长度大于6且小于9的元素,则第一窗口为窗口3。
上述参见图5的描述仅为示例而非限定,实际应用中,可以根据应用需求灵活配置a与b的取值,以合理确定第一窗口的位置,从而选择合理的自注意力依赖范围。
在本申请实施例中,通过根据序列中第一元素的位置确定用于对第一元素进行自注意力计算的窗口的位置,使得可以灵活地选择第一元素的自注意力依赖范围,因此可以进一步地减小对自注意力的依赖范围的限制。
第二种方式,第一窗口的位置是预设的。例如,第一窗口的位置与第一元素的位置无关。例如,继续参见图5,可以设置对元素7与元素8进行自注意力计算时,均使用窗口2。
继续参见图2,在包括自注意力层的神经网络中,通常包括多个自注意力层,如图2所示的“×N”,表示神经网络中可以包括N个图2中虚线所示的层组合,即包括多个自注意力层。
本申请实施例提供的序列处理的方法,不仅可以应用于单个自注意力层上,还可应用于多个自主注意力层上。其中,通过合理设置相邻两层上的窗口的位置,可以进一步减小自注意力的计算量。下文将描述。
可选地,在图4所示实施例中,方法400应用于多个自注意力层,输入序列是当前自注意力层的前一级自注意力层输出的序列,第一窗口的位置是根据第一元素的位置确定的,第一窗口包含输入序列中与第一元素的依赖长度大于或等于a,且小于b的元素,其 中,b与a的取值被设置为,使得当前自注意力层对第一元素的自注意力计算与前一级自注意力层对第一元素的自注意力计算没有重复计算。
为了便于理解与描述,将当前自注意力层记为自注意力层X,将自注意力层X的前一级自注意力层记为自注意力层(X-1)。假设自注意力层(X-1)在对第一元素进行自注意力计算时,已建立了第一元素与元素1之间的关系,则在方法400中,设置b与a的取值,可以跳过元素1,使用其他元素对第一元素进行自注意力计算。
例如,自注意力层(X-1)基于第五窗口内包含的元素对第一元素进行自注意力计算,第五窗口包含序列中与第一元素的依赖长度大于或等于a1,且小于b1的元素,b1为正整数,a1为小于b1的非负整数,则在方法400中,a的取值大于b1的取值。
作为一个示例,如图7所示,输入序列由元素1至元素15组成,第一元素为元素7。自注意力层(X-1)为自注意力层X的前一级自注意力层,自注意力层X的输入序列是基于自注意力层(X-1)的输出序列得到的。自注意力层(X-1)使用窗口1内的元素对第一元素进行自注意力计算,则自注意力层X在对第一元素进行自注意力计算时可以跳过元素6、7、8,例如,可以使用窗口2、窗口3或窗口4内的元素进行计算,这样可以避免重复计算。
需要说明的是,图7仅为示例而非限定。在实际应用中,可以根据具体需求,协调设置上下相邻两个自注意力层上的窗口,以减小自注意力的计算量。
在图7的示例中,在自注意力层(X-1)与自注意力层X之间没有其他神经网络层的情况下,自注意力层X的输入序列直接就是自注意力层(X-1)的输出序列。例如,在自注意力层(X-1)与自注意力层X之间具有其他神经网络层的情况下,自注意力层X的输入序列是自注意力层(X-1)的输出序列经过其他神经网络层处理后输出的序列。
作为另一个示例,在具有三个自注意力层的场景中,假设在每个自注意力层上,针对序列中的第一元素,使用包含序列中与第一元素的依赖长度大于a且小于b的元素的窗口内的元素进行自注意力计算。假设自注意力层1是自注意力层2的前一级,自注意力层2是自注意力层3的前一级。3个自注意力层上a与b的定义如表1所示。
表1
自注意力层 a b
1 0 5
2 5 12
3 12 18
在本申请实施例中,通过根据序列中第一元素的位置确定用于对第一元素进行自注意力计算的第一窗口的位置,可以使得多个注意力层之间避免重复计算,从而进一步减小自注意力的计算量。
如前文描述,在本申请实施例中,可以使用一个或多个窗口内的元素对第一元素进行自注意力计算。
可选地,在图4所示实施例中,步骤S420包括:使用一个窗口(即第一窗口)内的元素对第一元素进行自注意力计算,获得第一元素的表示。
可选地,在图4所示实施例中,步骤S420包括:对第一元素,使用M个窗口内包含的元素进行自注意力计算,获得第一元素的表示,其中,M个窗口中每个窗口包含输入序 列中的一个元素或多个连续的元素,不同窗口之间至少间隔一个元素,M个窗口中包括所述第一窗口,M大于1,且M的取值是预设的。
作为示例,用于对第一元素进行自注意力计算的M个窗口如图8、图9、图10、图11与图12所示。例如,在图8中,输入序列由元素1至元素15组成,用于对第一元素进行自注意力计算的M个窗口包括包含元素1、2与3的窗口1与包含元素11、12、13的窗口2,窗口1与窗口2之间间隔7个元素。
在本申请实施例中,通过使用大于1个的窗口对序列中的元素进行自注意力计算,这可以保证自注意力的依赖范围。可以理解到,针对一个元素,进行自注意力计算的窗口越多,该元素的自注意力依赖范围越大。本申请实施例可以通过合理设置窗口的数量,来保证自注意力的依赖范围。
M的取值是预设的,表示,M的取值与输入序列的长度无关。也就是说,M的取值可以不随输入序列长度的增大而增大。
作为一个示例,假设输入序列的长度为L1,M的取值被设置为Q;假设输入序列的长度为L2(L2>L1),M的取值依然被设置为Q;假设输入序列的长度为L3(L3<L1),M的取值依然被设置为Q。例如,Q等于2或3或其它大于1的整数。
在本申请实施例中,对一个元素进行自注意力计算的窗口的个数M与输入序列的长度无关,因此,可以避免现有技术中存在的计算开销随输入序列的长度呈平方增长的问题,因此,相对于现有技术可以减小自注意力的计算量。
此外,在本申请实施例中,用于对一个元素进行自注意力计算的M个窗口中不同窗口之间间隔一个或多个元素,这也可以减小自注意的计算量。
此外,M的取值是预设的,也就是说,本申请实施例可以对自注意力的计算量具有一定程度的控制,从而可以通过M的取值的设置来减小自注意力的计算量。
可以根据应用需求确定M的取值。例如,可以根据当前计算能力合理设置M的取值。在计算能力较强的情况下,可以为M设置较大的取值;在计算能力较弱的情况下,可以为M设置较小的取值。
还应理解,在一定程度上,M的取值越大,自注意力的依赖范围也越大。因此,本申请实施例可以在自注意力的计算量不超过计算能力的前提下,尽量扩大自注意力的依赖范围。
因此,本申请实施例在对序列中的元素进行自注意力计算时,通过使用多个窗口内的元素进行计算,该多个窗口的个数与序列的长度无关,且不同窗口之间具有间隔,可以在减小自注意力的计算量的同时,尽量兼顾自注意力的依赖范围,从而可以实现自注意力的计算量与依赖范围的平衡。
使用M个窗口内包含的元素对第一元素进行自注意力计算,获得第一元素的表示,表示,通过建立第一元素与M个窗口内的每个元素之间的关系(即第一元素的元素关系的建模),获得第一元素的表示。
作为一个示例,可以使用如下公式对第一元素h i进行自注意力计算,获得第一元素的新的表示h′ i
h′ i=Attend(h i,S)
其中,S表示M个窗口内包含的元素,Attend()表示自注意力的计算方式。自注意 力的计算方式为现有技术,本文不作详述。
类似于第一窗口的位置的确定方式,M个窗口的位置也可以通过多种方式确定。例如,M个窗口的位置是根据第一元素的位置确定的,或者,M个窗口的位置是预设的,与第一元素的位置无关。
作为一个示例,如图8所示,输入序列由元素1至元素15组成,第一元素为元素7,假设设置对元素7进行自注意力计算的窗口内的元素与元素7的依赖长度大于3,且小于7,则对元素7进行自注意力计算的M个窗口包括窗口1与窗口2。
作为另一个示例,如图10所示,输入序列为文本序列“他在26岁创立著名的狭义相对论”,第一元素为元素“创立”,假设设置对元素“创立”进行自注意力计算的窗口内的元素与元素“创立”的依赖长度大于2,且小于5,则对元素“创立”进行自注意力计算的M个窗口包括窗口1与窗口2。
在M个窗口的位置根据第一元素的位置而确定的实施例中,基于第一元素在输入序列中的不同位置,其对应的M个窗口的确定方式可以不同。
方式1),在第一元素为位于输入序列的中间位置的元素的情况下,用于对第一元素进行自注意力计算的M个窗口均位于第一元素的后面。
可选地,在M个窗口的位置根据第一元素的位置而确定的实施例中,在第一元素为输入序列中的中间元素的情况下,M个窗口中包括第三窗口,第三窗口包含输入序列中位于第一元素后面的与第一元素的依赖长度大于或等于ar,且小于br的元素,br为正整数,ar为小于br的非负整数。
作为一个示例,可以使用如下公式对第一元素h i进行自注意力计算,获得第一元素的新的表示h′ i
h′ i=Attend(h i,S)
其中,S={h i|i+ar≤j≤i+br},Attend()表示自注意力的计算方式。
作为一个示例,如图11所示,输入序列由元素1至元素15组成,第一元素为输入序列中的中间元素:元素7,用于对元素7进行自注意力计算的M个窗口包括位于元素7后面的窗口1与窗口2。其中,窗口1包含的元素与元素7的依赖长度大于2,且小于5,窗口2包含的元素与元素7的依赖长度大于6,且小于9。
方式2),在第一元素为位于输入序列的中间位置的元素的情况下,用于对第一元素进行自注意力计算的M个窗口均位于第一元素的前面。
可选地,在M个窗口的位置根据第一元素的位置而确定的实施例中,在第一元素为输入序列中的中间元素的情况下,M个窗口中包括第二窗口,第二窗口包含输入序列中位于第一元素前面的与第一元素的依赖长度大于或等于al,且小于bl的元素,bl为正整数,al为小于bl的非负整数。
作为一个示例,可以使用如下公式对第一元素h i进行自注意力计算,获得第一元素的新的表示h′ i
h′ i=Attend(h i,S)
其中,S={h i|i-bl≤j≤i-al},Attend()表示自注意力的计算方式。
作为一个示例,如图12所示,输入序列由元素1至元素15组成,第一元素为输入序列中的中间元素:元素7,用于对元素7进行自注意力计算的M个窗口包括位于元素7 前面的窗口1与窗口2。其中,窗口1包含的元素与元素7的依赖长度大于4,且小于7,窗口2包含的元素与元素7的依赖长度大于1,且小于4。
方式3),在第一元素为位于输入序列的中间位置的元素的情况下,用于对第一元素进行自注意力计算的M个窗口可以包括位于第一元素的前面的窗口以及位于第一元素的后面的窗口。
可选地,在M个窗口的位置根据第一元素的位置而确定的实施例中,在第一元素为输入序列中的中间元素的情况下,M个窗口中包括第二窗口与第三窗口。第二窗口包含输入序列中位于第一元素前面的与第一元素的依赖长度大于或等于al,且小于bl的元素,bl为正整数,al为小于bl的非负整数。第三窗口包含输入序列中位于第一元素后面的与第一元素的依赖长度大于或等于ar,且小于br的元素,br为正整数,ar为小于br的非负整数。
作为一个示例,可以使用如下公式对第一元素h i进行自注意力计算,获得第一元素的新的表示h′ i
h′ i=Attend(h i,S)
其中,S={h i|i-bl≤j≤i-al或i+ar≤j≤i+br},Attend()表示自注意力的计算方式。
在本例中,al与ar可以相等或不相等,bl与br可以相等或不相等。
作为一个示例,如图8所示,输入序列由元素1至元素15组成,第一元素为输入序列中的中间元素:元素7,用于对元素7进行自注意力计算的M个窗口包括位于元素7前面的窗口1与位于元素7后面的窗口2,窗口1内的元素与元素7的依赖长度大于3,且小于7,窗口2内的元素与元素7的依赖长度也是大于3,且小于7。
方式4),在第一元素为输入序列中首个元素的情况下,用于对第一元素进行自注意力计算的M个窗口为位于第一元素的后面的多个窗口。
方式5),在第一元素为输入序列中最后一个元素的情况下,用于对第一元素进行自注意力计算的M个窗口为位于第一元素的前面的多个窗口。
应理解,上述方式1)、方式2)与方式3)中任一种方式可以与方式4)和方式5)组合。
可选地,在一些实施例中,M个窗口中还可以包括第四窗口,第四窗口包含第一元素及其相邻元素。
作为一个示例,如图9所示,输入序列由元素1至元素15组成,第一元素为输入序列中的中间元素:元素7,用于对元素7进行自注意力计算的M个窗口不仅包括不包含元素7及其相邻元素的窗口1与窗口2,还包括窗口3,窗口3中包含元素7及其相邻元素:元素6与元素8。
在本申请实施例中,针对序列中的一个元素,通过根据该元素的位置确定用于对该元素进行自注意力计算的多个窗口的位置,从而可以灵活地实现自注意力的依赖范围。
可选地,在一些实施例中,M个窗口的位置也可以是预设的。例如,可以与第一元素的位置无关。作为一个示例,以输入序列如图8所示为例,对于输入序列中每个元素,用于对其进行自注意力计算的M个窗口均为图8所示的窗口1与窗口2。
上文实施例中以图8至图12为例描述了用于对序列中的第一元素进行自注意力计算的M个窗口,需要说明的是,图8至图12仅为示例而非限定。在实际应用中,可以根据 应用需求设置M的取值,以尽可能地减小自注意力的计算量,也可以根据应用需求设置M个窗口中每个窗口的边界以及M个窗口中不同窗口之间的间隔,以实现合理的自注意力依赖范围。
在本申请实施例中,在对序列中的元素进行自注意力计算时,通过使用多个窗口内的元素进行计算,该多个窗口的个数与序列的长度无关,且不同窗口之间具有间隔,可以在减小自注意力的计算量的同时,尽量兼顾自注意力的依赖范围,从而可以实现自注意力的计算量与依赖范围的平衡。此外,通过根据待计算元素的位置确定用于对该元素进行自注意力计算的多个窗口的位置,可以灵活地实现自注意力的依赖范围。
在多个自主注意力层的场景中,通过采用本申请实施例提供的方法,可以让高层的自注意力层跳过前面层已经建模过的部分元素,可以减小计算量。
图13与图14示出在三个自注意力层场景下,使用本申请实施例提供的自注意力机制与图3所示的局部自注意力对同一个文本序列进行自注意力计算的情形。在图13与图14中,文本序列为“他在26岁创立著名的狭义相对论”,自注意力层(X-2)为自注意力层(X-1)的前一级层,自注意力层(X-1)为自注意力层X的前一级层。
图13为使用本申请实施例提供的自注意力机制对文本序列进行自注意力计算的示意图。以对元素“创立”进行自注意力计算为例,在自注意力层(X-2)上,使用元素“岁”、“创立”与“著名”进行计算;在自注意力层(X-1)上,使用元素“在”、“26”、“的”与“狭义”进行计算;在自注意力层X上,使用元素“他”与“相对论”进行计算。可知,3个自注意力层分别在对元素“创立”进行自注意力计算时,自注意力层(X-1)跳过了自注意力层(X-2)已使用的元素(元素“岁”、“创立”与“著名”),自注意力层X跳过了自注意力层(X-1)已使用的元素(元素“在”、“26”、“的”与“狭义”),这可以减小计算量。
继续参见图13,通过自注意力层(X-2)对元素“相对论”的自注意力计算、自注意力层(X-1)对元素“创立”的自注意力计算、自注意力层X对元素“他”的自注意力计算,实现了序列中距离最远的两个元素“他”与“相对论”的关系建立。换句话说,通过3个自注意力层的处理,实现了长度为8的依赖。
图14为使用图3所示的局部自注意力机制对文本序列进行自注意力计算的示意图。以对元素“创立”进行自注意力计算为例,在自注意力层(X-2)上,使用元素“26”、“岁”、“创立”、“著名”与“的”进行计算;在自注意力层(X-1)与自注意力层X上,依然使用元素“26”、“岁”、“创立”、“著名”与“的”进行计算,这导致了多个自注意力层之间的重复计算。
继续参见图14,通过自注意力层(X-2)对元素“创立”的自注意力计算、自注意力层(X-1)对元素“26”的自注意力计算、自注意力层X对元素“他”的自注意力计算,仅实现了序列中元素“他”与“的”的关系建立。换句话说,通过3个自注意力层的处理,实现了长度为6的依赖。
对比图13与图14可知,在经过相同数量的自注意力层的处理的情况下,本申请实施例提供的自注意力机制比现有局部自注意力机制可以建模更远距离的依赖。
本申请实施例提供的序列处理的方法可以应用于语音处理***。例如,该语音处理***为语音识别***。例如,上述实施例提供的方法400中的输入序列为语音序列。
本申请实施例提供的序列处理的方法还可以应用于自然语音处理***。例如,自然语 音处理***为下列***中的任一种***:翻译***、基于BERT模型的自然语言理解模块(natural language understanding,NLU)***。例如,上述实施例提供的方法400中的输入序列为语音序列。
本文中描述的各个实施例可以为独立的方案,也可以根据内在逻辑进行组合,这些方案都落入本申请的保护范围中。
上文描述了本申请提供的方法实施例,下文将描述本申请提供的装置实施例。应理解,装置实施例的描述与方法实施例的描述相互对应,因此,未详细描述的内容可以参见上文方法实施例,为了简洁,这里不再赘述。
图15为本申请实施例提供的序列处理的装置1500的示意性框图。装置1500包括输入单元1510、处理单元1520与输出单元1530。
输入单元1510,用于接收输入序列,并将输入序列输入处理单元1520,输入序列包括多个具有先后顺序的元素。
处理单元1520,用于对输入序列中的第一元素,使用M个窗口内包含的元素进行自注意力计算,获得第一元素的表示,其中,M个窗口中每个窗口内包含输入序列中的一个元素或连续的多个元素,且不同窗口之间至少间隔一个元素,M个窗口中至少一个窗口内不包含第一元素,M为大于或等于1的整数。
输出单元1530,用于基于第一元素的表示,获得输入序列对应的输出序列。
可选地,在一些实施例中,处理单元1520还用于,处理单元还用于,根据第一元素在输入序列中的位置,确定M个窗口,M个窗口中包括第一窗口,第一窗口包含输入序列中与第一元素的依赖长度大于或等于a,且小于b的元素,其中,a为大于1的整数,b为大于a的整数,该依赖长度表示第一元素与所述M个窗口内的元素之间的距离。
可选地,在一些实施例中,装置1500应用于多个自注意力层,输入序列是当前自注意力层的前一级自注意力层输出的序列;处理单元1520还用于,根据第一元素在输入序列中的位置,确定第一窗口,第一窗口包含输入序列中与第一元素的依赖长度大于或等于a,且小于b的元素,b与a的取值被设置为,使得当前自注意力层对第一元素的自注意力计算与前一级自注意力层对第一元素的自注意力计算没有重复计算。
可选地,在一些实施例中,前一级自注意力层基于第五窗口内包含的元素对第一元素进行自注意力计算,第五窗口包含序列中与第一元素的依赖长度大于或等于a1,且小于b1的元素,b1为正整数,a1为小于b1的非负整数;处理单元1520还用于,根据第一元素在输入序列中的位置,确定第一窗口,第一窗口包含输入序列中与第一元素的依赖长度大于或等于a,且小于b的元素,其中,a的取值大于b1的取值。
可选地,在一些实施例中,M等于1,处理单元1520,用于对输入序列中的第一元素,使用第一窗口内包含的元素进行自注意力计算,获得第一元素的表示,其中,第一窗口内包含输入序列中的一个元素或连续的多个元素,但不包含第一元素。
可选地,在一些实施例中,M大于1,且M的取值是预设的。
可选地,在一些实施例中,M个窗口中包括第二窗口和/或第三窗口。
第二窗口,第二窗口包含输入序列中位于第一元素前面的与第一元素的依赖长度大于或等于al,且小于bl的元素,bl为正整数,al为小于bl的非负整数。
第三窗口,第三窗口包含输入序列中位于第一元素后面的与第一元素的依赖长度大于 或等于ar,且小于br的元素,br为正整数,ar为小于br的非负整数。
可选地,在一些实施例中,M个窗口中包括第四窗口,第四窗口包含第一元素及其相邻元素。
可选地,在一些实施例中,输入序列为语音序列或文本序列。
本申请实施例提供的序列处理的装置1500也可以称为序列处理装置。可选地,该序列处理装置中还可以包括其他神经网络层的处理模块。
如图16所示,本申请实施例还提供一种序列处理的装置1600。该装置1600包括处理器1610,处理器1610与存储器1620耦合,存储器1620用于存储计算机程序或指令,处理器1610用于执行存储器1620存储的计算机程序或指令,使得上文方法实施例中的方法被执行。
可选地,如图16所示,该装置1600还可以包括存储器1620。
可选地,如图16所示,该装置1600还可以包括数据接口1630,数据接口1630用于与外界进行数据的传输。
如图17所示,本申请实施例还提供一种包括神经网络处理装置1700,包括输入模块1710、处理模块1720、输出模块1730以及本申请实施例提供的序列处理的装置1500。
输入模块1710,用于将待处理的输入序列传递到序列处理的装置1500。
可选地,输入模块1710还可以包括特征提取单元,用于从待处理数据中提取特征数据,该特征数据作为序列处理的装置1500的输入。
序列处理的装置1500用于对输入序列进行自注意力计算,获得该输入序列对应的输出序列。
处理模块1720,用于对装置1500获得的输出序列进行处理,获得序列处理结果。
输出模块1730,用于基于处理模块1720获得的序列处理结果输出输出信号。
可选地,在一些实施例中,输入模块1710,用于将语音序列传递到序列处理的装置1500;处理模块1720用于对序列处理的装置1500获得的输出序列进行语音识别处理,获得语音识别结果。
在本实施例中,神经网络处理装置1700可以称为语音处理***。
可选地,在一些实施例中,输入模块1710,用于将待文本序列传递到序列处理的装置1500;处理模块1720用于对序列处理的装置1500获得的输出序列进行语义理解处理,获得语义理解结果。
在本实施例中,神经网络处理装置1700可以称为自然语言处理***。
通过组合其他类型的神经网络层与应用本申请实施例提供的自注意力机制的自注意力层,可以构建高效的序列数据处理***。
图18为本申请实施例可以应用的语音识别***1800的示意性框图。语音识别***1800可用于进行实时的语音识别。语音识别***1800包括输入模块1810、识别器模块1820与输出模块1830。识别器模块1820为包括自注意力层的神经网络,其中,识别器模块1820包括的至少一个自注意力层采用本申请实施例提供的自注意力机制,即采用上文实施例提供的方法400处理输入序列。
输入模块1810用于接收待处理数据,并基于待处理数据获得识别器模块1820的输入,即输入序列。
例如,输入模块1810中可以包括声学特征提取单元。声学特征提取单元用于对输入的待处理数据进行特征提取,获得特征数据。声学特征提取单元提取的特征数据是识别器模块1820的输入。
识别器模块1820用于对输入模块1810输入的序列进行语音识别的处理,获得语音识别结果。识别器模块1820包括自注意力模块1821与其他神经网络模块1822。
例如,自注意力模块1821包括如下结构:批标准化(batch normalization)层、自注意力层、残差连接(residual)、FFN层。自注意力模块1821中包括的至少一个自注意力层采用了本申请实施例提供的自注意力机制,即采用上文实施例提供的方法400处理输入的序列。
残差连接是一种神经网络连接方式,一般指的是,把当前层的输出与前面某一个层的输出相加作为输出。批标准化(batch normalization)是一种对神经网络的中间值做归一化的方法。FFN层例如为Position-wise FFN,Position-wise FFN是指在对序列中每个位置都使用同一个FFN,该FFN有两层,第一层的激活函数是ReLU,第二层没有激活函数。其中,ReLU是一种神经网络的激活函数。例如,ReLU的计算方法为y=max(x,0),其中x表示输入,y表示输出。
例如,自注意力模块1821可以堆叠N次。
其他神经网络模块1822可以包括卷积模块(Convolution block)。例如,卷积模块可重复堆叠M次。
例如,其他神经网络模块1822可以为ConvBlock。ConvBlock指的是,卷积(Convolution)层接批标准化(batch normalization)层再接ReLU的结构。
例如,识别器模块1820也可以堆叠K次。
卷积(Convolution)层、批标准化(batch normalization)层、FFN、ReLU都是常见的神经网络结构组件,本申请对此不作详述。
输出模块1830用于基于识别器模块1820获得的语音识别结果输出输出信号。例如,输出信号为字符序列。
可选地,输出模块1830包括如下结构:层标准化(layer normalizationlayer norm)与输出前馈神经网路(output ffn)。
前馈神经网络(FFN)是一种神经网络。例如,单层FFN的计算过程可以表示为y=act(Wx+b),其中,x表示输入特征数据,y表示输出特征数据,W与b表示参数,act()表示激活函数。
应理解,本申请实施例提供的语音识别***1800,因为应用了本申请实施例提供的自注意力机制,因此可以减小自注意力的计算量,同时可以保证自注意力的依赖范围,从而可以实现序列数据的高效处理。
本申请实施例还提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行上述实施例的方法。
本申请实施例还提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述实施例的方法。
本申请实施例还提供一种芯片,该芯片包括处理器与数据接口,处理器通过数据接口读取存储器上存储的指令,执行上述实施例的方法。
可选地,作为一种实现方式,该芯片还可以包括存储器,存储器中存储有指令,处理器用于执行存储器上存储的指令,当指令被执行时,处理器用于执行上述实施例中的方法。
图19为本申请实施例提供的一种芯片硬件结构,该芯片上包括神经网络处理器1900。该芯片可以被设置在如下任一种或多种装置中:
如图15所示的装置1500、如图16所示的装置1600、如图17中所示的装置1700、如图18所示的装置1800。
上文方法实施例中的方法400可在如图19所示的芯片中得以实现。
神经网络处理器1900作为协处理器挂载到主处理器(Host CPU)上,由主CPU分配任务。神经网络处理器1900的核心部分为运算电路1903,控制器1904控制运算电路1903获取存储器(权重存储器1902或输入存储器1901)中的数据并进行运算。
在一些实现中,运算电路1903内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路1903是二维脉动阵列。运算电路1903还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1903是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路1903从权重存储器1902中取矩阵B相应的数据,并缓存在运算电路1903中每一个PE上。运算电路1903从输入存储器1901中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1908中。
向量计算单元1907可以对运算电路1903的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元1907可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元能1907将经处理的输出的向量存储到统一存储器(也可称为统一缓存器)1906。例如,向量计算单元1907可以将非线性函数应用到运算电路1903的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1907生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1903的激活输入,例如用于在神经网络中的后续层中的使用。
上文方法实施例中的方法400可以由1903或1907执行。
统一存储器1906用于存放输入数据以及输出数据。
可以通过存储单元访问控制器1905(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器1901和/或统一存储器1906、将外部存储器中的权重数据存入权重存储器1902,以及将统一存储器1906中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)1910,用于通过总线实现主CPU、DMAC和取指存储器1909之间进行交互。
与控制器1904连接的取指存储器(instruction fetch buffer)1909,用于存储控制器1904使用的指令;
控制器1904,用于调用指存储器1909中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器1906,输入存储器1901,权重存储器1902以及取指存储器1909 均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
需要说明的是,本文中涉及的第一或第二等各种数字编号仅为描述方便进行的区分,并不用来限制本申请实施例的范围。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的***、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的***、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:通用串行总线闪存盘(USB flash disk,UFD)(UFD也可以简称为U盘或者优盘)、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种序列处理的方法,其特征在于,包括:
    接收输入序列,所述输入序列包括多个具有先后顺序的元素;
    对所述输入序列中的第一元素,使用M个窗口内包含的元素进行自注意力计算,获得所述第一元素的表示,其中,所述M个窗口中每个窗口内包含所述输入序列中的一个元素或连续的多个元素,且不同窗口之间至少间隔一个元素,所述M个窗口中至少一个窗口内不包含所述第一元素,M为大于或等于1的整数;
    基于所述第一元素的表示,获得所述输入序列对应的输出序列。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    根据所述第一元素在所述输入序列中的位置,确定所述M个窗口,所述M个窗口中包括第一窗口,所述第一窗口包含所述输入序列中与所述第一元素的依赖长度大于或等于a,且小于b的元素,其中,a为大于1的整数,b为大于a的整数,所述依赖长度表示所述第一元素与所述M个窗口内的元素之间的距离。
  3. 根据权利要求2所述的方法,其特征在于,所述方法应用于多个自注意力层,所述输入序列是当前自注意力层的前一级自注意力层输出的序列;
    其中,b与a的取值被设置为,使得所述当前自注意力层对所述第一元素的自注意力计算与所述前一级自注意力层对所述第一元素的自注意力计算没有重复计算。
  4. 根据权利要求3所述的方法,其特征在于,所述前一级自注意力层基于第五窗口内包含的元素对所述第一元素进行自注意力计算,所述第五窗口包含所述序列中与所述第一元素的依赖长度大于或等于a1,且小于b1的元素,b1为正整数,a1为小于b1的非负整数;
    其中,a的取值大于b1的取值。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述M大于1,且M的取值是预设的。
  6. 根据权利要求5所述的方法,其特征在于,所述M个窗口中包括:
    第二窗口,所述第二窗口包含所述输入序列中位于所述第一元素前面的元素;和/或
    第三窗口,所述第三窗口包含所述输入序列中位于所述第一元素后面的的元素。
  7. 根据权利要求5或6所述的方法,其特征在于,所述M个窗口中包括第四窗口,所述第四窗口包含所述第一元素及其相邻元素。
  8. 根据权利要求1-7中任一项所述的方法,其特征在于,所述输入序列为语音序列或文本序列。
  9. 一种序列处理的装置,其特征在于,包括:
    接收单元,用于接收输入序列,所述输入序列包括多个具有先后顺序的元素;
    处理单元,用于对所述输入序列中的第一元素,使用M个窗口内包含的元素进行自注意力计算,获得所述第一元素的表示,其中,所述M个窗口中每个窗口内包含所述输入序列中的一个元素或连续的多个元素,且不同窗口之间至少间隔一个元素,所述M个窗口中至少一个窗口内不包含所述第一元素,M为大于或等于1的整数;
    输出单元,用于基于所述第一元素的表示,获得所述输入序列对应的输出序列。
  10. 根据权利要求9所述的装置,其特征在于,所述处理单元还用于,根据所述第一元素在所述输入序列中的位置,确定所述M个窗口,所述M个窗口中包括第一窗口,所述第一窗口包含所述输入序列中与所述第一元素的依赖长度大于或等于a,且小于b的元素,其中,a为大于1的整数,b为大于a的整数,所述依赖长度表示所述第一元素与所述M个窗口内的元素之间的距离。
  11. 根据权利要求10所述的装置,其特征在于,所述装置应用于多个自注意力层,所述输入序列是当前自注意力层的前一级自注意力层输出的序列;
    其中,b与a的取值被设置为,使得所述当前自注意力层对所述第一元素的自注意力计算与所述前一级自注意力层对所述第一元素的自注意力计算没有重复计算。
  12. 根据权利要求11所述的装置,其特征在于,所述前一级自注意力层基于第五窗口内包含的元素对所述第一元素进行自注意力计算,所述第五窗口包含所述序列中与所述第一元素的依赖长度大于或等于a1,且小于b1的元素,b1为正整数,a1为小于b1的非负整数;
    其中,a的取值大于b1的取值。
  13. 根据权利要求9-12中任一项所述的装置,其特征在于,M大于1,且M的取值是预设的。
  14. 根据权利要求13所述的装置,其特征在于,所述M个窗口中包括:
    第二窗口,所述第二窗口包含所述输入序列中位于所述第一元素前面的元素;和/或
    第三窗口,所述第三窗口包含所述输入序列中位于所述第一元素后面的元素。
  15. 根据权利要求13或14所述的装置,其特征在于,所述M个窗口中包括第四窗口,所述第四窗口包含所述第一元素及其相邻元素。
  16. 根据权利要求9-15中任一项所述的装置,其特征在于,所述输入序列为语音序列或文本序列。
  17. 一种神经网络处理装置,其特征在于,包括输入模块、处理模块、输出模块以及如权利要求9-16中任一项所述的序列处理的装置;
    所述输入模块用于,将输入序列输入所述序列处理的装置;
    所述序列处理的装置用于,对所述输入序列进行自注意力计算,获得所述输入序列对应的输出序列;
    所述处理模块用于,对所述输出序列进行处理,获得序列处理结果;
    所述输出模块,用于基于所述处理模块获得的序列处理结果输出输出信号;
    其中,在所述输入序列为语音序列的情况下,所述处理模块用于对所述输出序列进行语音识别处理,获得语音识别结果;或
    在所述输入序列为文本序列的情况下,所述处理模块用于对所述输出序列进行语义理解处理,获得语义理解结果。
  18. 一种数据处理的装置,其特征在于,包括:
    存储器,用于存储可执行指令;
    处理器,用于调用并运行所述存储器中的所述可执行指令,以执行权利要求1至8中任一项所述的方法。
  19. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序指令,当所述程序指令由处理器运行时,实现权利要求1至8中任一项所述的方法。
  20. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序代码,当所述计算机程序代码在计算机上运行时,实现权利要求1至8中任一项所述的方法。
PCT/CN2021/073868 2020-05-26 2021-01-27 序列处理的方法与装置 WO2021238289A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21813598.6A EP4152203A4 (en) 2020-05-26 2021-01-27 SEQUENCE PROCESSING METHOD AND APPARATUS
US17/994,068 US20230088915A1 (en) 2020-05-26 2022-11-25 Method and apparatus for sequence processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010454695.6 2020-05-26
CN202010454695.6A CN111783446B (zh) 2020-05-26 2020-05-26 序列处理的方法与装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/994,068 Continuation US20230088915A1 (en) 2020-05-26 2022-11-25 Method and apparatus for sequence processing

Publications (1)

Publication Number Publication Date
WO2021238289A1 true WO2021238289A1 (zh) 2021-12-02

Family

ID=72753447

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073868 WO2021238289A1 (zh) 2020-05-26 2021-01-27 序列处理的方法与装置

Country Status (4)

Country Link
US (1) US20230088915A1 (zh)
EP (1) EP4152203A4 (zh)
CN (1) CN111783446B (zh)
WO (1) WO2021238289A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783446B (zh) * 2020-05-26 2022-07-19 华为技术有限公司 序列处理的方法与装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919188A (zh) * 2019-01-29 2019-06-21 华南理工大学 基于稀疏局部注意力机制和卷积回声状态网络的时序分类方法
CN110096711A (zh) * 2019-05-09 2019-08-06 中国科学技术大学 序列全局关注和局部动态关注的自然语言语义匹配方法
CN110162625A (zh) * 2019-04-19 2019-08-23 杭州电子科技大学 基于句内词对关系和上下文用户特征的反讽检测方法
WO2019240900A1 (en) * 2018-06-12 2019-12-19 Siemens Aktiengesellschaft Attention loss based deep neural network training
CN111783446A (zh) * 2020-05-26 2020-10-16 华为技术有限公司 序列处理的方法与装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224287B (zh) * 2014-06-24 2017-12-29 联想(北京)有限公司 数据处理方法、装置及电子设备
CN110163339A (zh) * 2019-03-06 2019-08-23 腾讯科技(深圳)有限公司 神经网络中网络表示生成、编码方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019240900A1 (en) * 2018-06-12 2019-12-19 Siemens Aktiengesellschaft Attention loss based deep neural network training
CN109919188A (zh) * 2019-01-29 2019-06-21 华南理工大学 基于稀疏局部注意力机制和卷积回声状态网络的时序分类方法
CN110162625A (zh) * 2019-04-19 2019-08-23 杭州电子科技大学 基于句内词对关系和上下文用户特征的反讽检测方法
CN110096711A (zh) * 2019-05-09 2019-08-06 中国科学技术大学 序列全局关注和局部动态关注的自然语言语义匹配方法
CN111783446A (zh) * 2020-05-26 2020-10-16 华为技术有限公司 序列处理的方法与装置

Also Published As

Publication number Publication date
CN111783446A (zh) 2020-10-16
EP4152203A4 (en) 2023-10-25
EP4152203A1 (en) 2023-03-22
CN111783446B (zh) 2022-07-19
US20230088915A1 (en) 2023-03-23

Similar Documents

Publication Publication Date Title
WO2021047286A1 (zh) 文本处理模型的训练方法、文本处理方法及装置
CN110546654B (zh) 通过构造接口的带宽控制来增强dnn模块的处理性能
US9916531B1 (en) Accumulator constrained quantization of convolutional neural networks
WO2020233130A1 (zh) 一种深度神经网络压缩方法及相关设备
WO2017219991A1 (zh) 适用于模式识别的模型的优化方法、装置及终端设备
WO2020073211A1 (zh) 运算加速器、处理方法及相关设备
WO2022068627A1 (zh) 一种数据处理方法及相关设备
JP2019036298A (ja) 知能型高帯域幅メモリシステム及びそのための論理ダイ
JP7331975B2 (ja) クロスモーダル検索モデルのトレーニング方法、装置、機器、および記憶媒体
TWI796286B (zh) 一種機器學習系統的訓練方法和訓練系統
US20220083868A1 (en) Neural network training method and apparatus, and electronic device
CN112219210B (zh) 信号处理装置和信号处理方法
WO2023231794A1 (zh) 一种神经网络参数量化方法和装置
CN109472344A (zh) 类神经网络***的设计方法
WO2023245965A1 (zh) 一种脉冲神经网络加速计算***、方法、设备及非易失性可读存储介质
US20220292337A1 (en) Neural network processing unit, neural network processing method and device
WO2018113790A1 (zh) 一种人工神经网络运算的装置及方法
WO2022233195A1 (zh) 神经网络权值存储方法、读取方法及相关设备
WO2021147276A1 (zh) 数据处理方法、装置及芯片、电子设备、存储介质
WO2021238289A1 (zh) 序列处理的方法与装置
CN113012689B (zh) 一种电子设备和深度学习硬件加速方法
CN117574970A (zh) 用于大规模语言模型的推理加速方法、***、终端及介质
WO2023231887A1 (zh) 基于张量的持续学习方法和装置
WO2021081854A1 (zh) 一种卷积运算电路和卷积运算方法
US20230143985A1 (en) Data feature extraction method and related apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21813598

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021813598

Country of ref document: EP

Effective date: 20221212

NENP Non-entry into the national phase

Ref country code: DE