WO2020048292A1 - 神经网络的网络表示生成方法、装置、存储介质和设备 - Google Patents

神经网络的网络表示生成方法、装置、存储介质和设备 Download PDF

Info

Publication number
WO2020048292A1
WO2020048292A1 PCT/CN2019/100212 CN2019100212W WO2020048292A1 WO 2020048292 A1 WO2020048292 A1 WO 2020048292A1 CN 2019100212 W CN2019100212 W CN 2019100212W WO 2020048292 A1 WO2020048292 A1 WO 2020048292A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
vector
local
representation
local enhancement
Prior art date
Application number
PCT/CN2019/100212
Other languages
English (en)
French (fr)
Inventor
涂兆鹏
杨宝嵩
张潼
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2020551812A priority Critical patent/JP7098190B2/ja
Priority to EP19857335.4A priority patent/EP3848856A4/en
Publication of WO2020048292A1 publication Critical patent/WO2020048292A1/zh
Priority to US17/069,609 priority patent/US11875220B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method, a device, a storage medium, and a device for generating a network representation of a neural network.
  • Attention mechanism is a method to build a model for the dependency relationship between the hidden state of the encoder and decoder in a neural network. Attention mechanism is widely used in deep learning-based natural language processing (NLP, Natural Language) Processing).
  • NLP Natural Language
  • SAN Self-Attention Network
  • SAN is a neural network model based on the self-attention mechanism. It belongs to one of the attention models. It can calculate an attention weight for each element pair in the input sequence. Long-distance dependencies can be captured, and the network representation corresponding to each element is not affected by the distance between each element. However, SAN fully considers each element in the input sequence, so it is necessary to calculate the attention weight between each element and all elements, which disperses the distribution of the weight to a certain extent, thereby weakening the connection between the elements .
  • a method for generating a network representation of a neural network for use in a computer device.
  • the method includes:
  • value vectors in the value vector sequence are fused to obtain a network representation sequence corresponding to the input sequence.
  • an apparatus for generating a network representation of a neural network includes:
  • An acquisition module for acquiring a source-side vector representation sequence corresponding to the input sequence
  • a linear transformation module configured to linearly transform the source-side vector representation sequence to obtain a request vector sequence, a key vector sequence, and a value vector sequence corresponding to the source-side vector representation sequence;
  • a logical similarity calculation module configured to calculate a logical similarity between the request vector sequence and the key vector sequence
  • a local enhancement matrix construction module configured to construct a local enhancement matrix according to the request vector sequence
  • An attention weight distribution determination module configured to perform a non-linear transformation based on the logical similarity and the local enhancement matrix to obtain a locally enhanced attention weight distribution corresponding to each of the elements;
  • a fusion module is configured to fuse the value vectors in the value vector sequence according to the attention weight distribution to obtain a network representation sequence corresponding to the input sequence.
  • a computer-readable storage medium which stores a computer program, and when the computer program is executed by a processor, the processor is caused to execute the steps of the network representation generating method of the neural network.
  • a computer device including a memory and a processor.
  • the memory stores a computer program.
  • the processor is caused to execute the network representation generation of the neural network. Method steps.
  • the above-mentioned neural network network representation generating method, device, storage medium, and device construct a local reinforcement matrix based on the request vector sequence corresponding to the input sequence, which can allocate attention weights and strengthen local information within the local reinforcement range.
  • a request vector sequence, a key vector sequence, and a value vector sequence can be obtained.
  • a logical similarity can be obtained according to the request vector sequence and the key vector sequence, and then based on the logical similarity and
  • the local reinforcement matrix is transformed non-linearly to obtain the locally enhanced attention weight distribution, which achieves the modification of the original attention weight.
  • the weighted summation of the value vector sequence according to the locally enhanced attention weight distribution can be enhanced.
  • the network representation sequence of local information can not only strengthen the local information, but also preserve the connection between the long distance elements in the input sequence.
  • FIG. 1 is an application environment diagram of a method for generating a network representation of a neural network in an embodiment
  • FIG. 2 is a schematic flowchart of a method for generating a network representation of a neural network according to an embodiment
  • FIG. 3 is a schematic diagram of a process of calculating a network representation sequence corresponding to an input sequence in an embodiment
  • FIG. 4 is a system architecture diagram using a Gaussian distribution to modify SAN attention weight distribution in an embodiment
  • FIG. 5 is a schematic flowchart of constructing a local reinforcement matrix according to a request vector sequence in an embodiment
  • FIG. 6 is a schematic flowchart of determining a local enhancement range according to a request vector sequence in an embodiment
  • FIG. 7 is a schematic flowchart of determining a local enhancement range according to a request vector sequence and a key vector sequence in an embodiment
  • FIG. 8 is a schematic structural diagram of a multi-layer stacked multi-head self-focus neural network according to an embodiment
  • FIG. 9 is a schematic flowchart of a method for generating a network representation of a neural network according to an embodiment
  • FIG. 10 is a structural block diagram of a network representation generating device of a neural network in an embodiment
  • FIG. 11 is a structural block diagram of a computer device in one embodiment.
  • FIG. 1 is an application environment diagram of a method for generating a network representation of a neural network in an embodiment.
  • the network representation generation method of the neural network is applied to a network representation generation system of a neural network.
  • the network representation generating system of the neural network includes a terminal 110 and a computer device 120.
  • the terminal 110 and the computer device 120 are connected through Bluetooth, USB (Universal Serial Bus), or a network.
  • the terminal 110 can send a pending input sequence to the computer device 120, which can be sent in real time or non-real time.
  • the computer device 120 is configured to receive an input sequence and transform the input sequence to output a corresponding network representation sequence.
  • the terminal 110 may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, and a notebook computer.
  • the computer device 120 may be an independent server or terminal, or a server cluster composed of multiple servers, or a cloud server that provides basic cloud computing services such as cloud servers, cloud databases, cloud storage, and CDNs.
  • the computer device 120 may directly obtain the input sequence without using the terminal 110.
  • the computer device is a mobile phone
  • the mobile phone can directly obtain the input sequence (such as the sequence formed by the words in the instant text message), and then use the network representation generating device of the neural network configured on the mobile phone to transform the input sequence and output the input.
  • the network corresponding to the sequence represents the sequence.
  • a method for generating a network representation of a neural network is provided. This embodiment is mainly described by using the method applied to the computer device 120 in FIG. 1 described above.
  • the method for generating a network representation of the neural network may include the following steps:
  • the input sequence is a sequence to obtain a corresponding network representation sequence after being transformed.
  • the input sequence Is of length I, and I is a positive integer.
  • the input sequence may be a word sequence corresponding to the text to be translated, and each element in the input sequence is each word in the word sequence.
  • the word sequence can be a sequence formed by arranging each word in the word order after the text to be translated is segmented; if the text to be translated is an English text, the word sequence is the word order The resulting sequence. For example, the text to be translated is "Bush, held, talked with Sharon", and the corresponding input sequence X is ⁇ Bush, held, a, talk, with, Sharon ⁇ .
  • the source-side vector representation sequence is a sequence formed by the corresponding source-side vector representation of each element in the input sequence.
  • the source-side vector indicates that each vector in the sequence corresponds to each element in the input sequence.
  • the computer equipment can convert each element in the input sequence into a fixed-length vector (that is, Word Embedding, word embedding).
  • the method for generating a network representation of a neural network is applied to a neural network model, and then the computer device can convert each element in the input sequence into a corresponding vector through the first layer of the neural network model, for example, input sequence
  • the i-th element x i in the transformation into a d-dimensional column vector is z i
  • the vectors corresponding to the elements in the input sequence are combined to obtain the source-side vector representation sequence corresponding to the input sequence, that is, one d
  • the computer equipment may also receive the source-side vector representation sequence corresponding to the input sequence sent by other equipment.
  • Both z i and the column vectors mentioned below can be row vectors. In order to facilitate the calculation process, this article uses column vectors for description.
  • S204 Perform linear transformation on the source-side vector representation sequence to obtain a request vector sequence, a key vector sequence, and a value vector sequence corresponding to the source-side vector representation sequence.
  • a linear transformation can map a vector that belongs to one vector space to another vector space.
  • a vector space is a collection of multiple vectors of the same dimension.
  • the computer device may perform linear transformation on the source-side vector representation sequence through three different learnable parameter matrices, thereby mapping the source-side vector representation sequence to three different vector spaces, respectively, to obtain the source-side vector representation sequence.
  • a vector represents a request vector sequence, a key vector sequence, and a value vector sequence corresponding to the sequence.
  • the network representation generation method of a neural network is applied to a model based on a SAN (self-focused neural network), then the request vector sequence, the key vector sequence, and the value vector sequence are all source vector corresponding to the input sequence of the source Represents the sequence obtained by linear transformation.
  • the network representation generation method of a neural network is applied to a neural network model including an Encoder-Decoder structure, and then the key vector sequence and the value vector sequence are corresponding to the input sequence by the encoder.
  • the source vector representation sequence is obtained by encoding, that is, the key vector sequence and value vector sequence are the output of the encoder, and the request vector sequence is the input of the decoder.
  • it can be the destination vector representation sequence, and the destination vector representation sequence can be decoded.
  • Vector representation of each element in the output sequence output by the processor is obtained by linear transformation.
  • the computer device can linearly transform the source vector representation sequence Z by using three different learnable parameter matrices W Q , W K and W V to obtain a request vector sequence Q and a key vector sequence.
  • K and value vector sequence V K and value vector sequence V:
  • V Z ⁇ W V ;
  • I ⁇ is a d-dimensional column vector, that is, Z is a vector sequence composed of I d-dimensional column vectors, which can be described as an I ⁇ d matrix;
  • the learnable parameter matrices W Q , W K, and W V are d ⁇ d Matrix;
  • the request vector sequence Q, the key vector sequence, and the K value vector sequence V are I ⁇ d matrices.
  • the logical similarity is used to measure the similarity between each element in the input sequence and other elements in the input sequence.
  • the corresponding attention weights can be assigned to the value vectors corresponding to other elements in the input sequence based on the similarity, so that the network representation corresponding to each element of the output considers that element and other
  • the connection between the elements enables the generated network representation to more accurately express the characteristics of each element and cover a wealth of information.
  • the network representation generation method of a neural network is applied to a neural network model including an Encoder-Decoder structure, and the request vector sequence is a target end vector representation sequence, and the calculated logical similarity is It is used to represent the similarity between the target vector representation sequence and the key vector sequence corresponding to the input sequence. Based on the similarity, the corresponding attention vector sequence is assigned to the value vector sequence corresponding to the input sequence, which can make each element output by the source The network representation can consider the influence of the target end vector representation sequence input by the target end.
  • the computer device can calculate the logical similarity matrix E between the request vector sequence Q and the key vector sequence K through the cosine similarity formula, that is:
  • K T represents the transposed matrix of the key vector sequence K
  • d is the dimension in which each element x i in the input sequence is transformed into the source vector representing z i
  • d is also the dimension represented by the network corresponding to x i , which is also the network hiding Dimension of the state vector, divided by the above formula It is to reduce the inner product and reduce the calculation speed.
  • each element of the column vector in the local reinforcement matrix represents the strong and weak connection between two or two elements in the input sequence.
  • a local enhancement matrix can be used to strengthen the influence of the element that is more related to the current element on the network representation in other elements in the input sequence, which can relatively weaken the relationship with the current element. The effect of large elements on the representation of the network.
  • the local reinforcement matrix can make the scope of consideration limited to local elements instead of all elements in the input sequence when considering the influence of other elements on the network representation of the current element, so that when assigning attention weights, it can be biased
  • the magnitude of the attention weight assigned to a value vector corresponding to an element in a local element is related to the strong and weak connection between the element and the current element, that is, it will be strongly related to the current element.
  • the value vector corresponding to the element is assigned a larger attention weight.
  • the method for generating a network representation of a neural network in this embodiment when outputting a network representation corresponding to "Bush”, can cause the attention weight to be allocated within a local enhancement range.
  • the network representation corresponding to "Bush” if the connection between the element "Bush” and the element "held” is strong, a higher attention weight will be assigned to the value vector corresponding to "held”, and then the value corresponding to "held” Similarly, the "a talk" in the local elements within the local enhancement range corresponding to "Bush” will also be noticed, and will be assigned a higher attention weight.
  • the computer device when the computer device generates a network representation corresponding to each element, it needs to determine a local enhancement range corresponding to the current element, so that the allocation of the attention weight corresponding to the current element is limited to the local enhancement range.
  • the local enhancement range can be determined according to two variables: the central point of the local enhancement range and the window size of the local enhancement range.
  • the central point refers to the highest attention weight assigned when generating the network representation of the current element.
  • the window size refers to the length of the local enhancement range. It determines how many elements of the attention weight are concentrated in the distribution.
  • the element involved with the center point and the window size as the span is Local enhancement range. Because the local reinforcement range corresponding to each element is related to each element itself, it is corresponding to each element, rather than fixed in a certain range, so that the network representation of each element generated can flexibly capture rich context information.
  • the computer device may determine the local enhancement range corresponding to each element according to the center point and the window size. This step may include: expecting the center point as a Gaussian distribution, and using the window size as the variance of the Gaussian distribution;
  • the local strengthening range is determined by the Gaussian distribution determined by the variance and the variance;
  • the computer equipment can calculate the strong and weak connection between the two elements based on the determined local strengthening range to obtain a local strengthening matrix, where the strong and weak relationship between the two elements is given by the following formula Calculated:
  • the local enhancement matrix G is an I ⁇ I matrix, including I column vectors, and the dimension of each column vector is I.
  • the value of each element in the vector of the i-th column of the local strengthening matrix G is determined based on the local strengthening range corresponding to the i-th element in the input sequence.
  • Formula (2) is a function symmetrical about the center point P i , and the numerator represents the input sequence.
  • the distance between the j-th element and the center point P i corresponding to the i-th element the closer the distance, the greater the G ij , indicating that the j-th element has a stronger connection with the i-th element, and vice versa
  • G ij using the formula (2) based on Gaussian distribution deformation is only an example.
  • the center point can be used as the desired,
  • the window size is used as the variance, and the value of G ij is calculated from other distributions with expectations and variances, so as to obtain a local strengthening matrix G, such as a Poisson distribution or a binomial distribution.
  • S210 Perform a non-linear transformation based on the logical similarity and the local enhancement matrix to obtain a locally enhanced attention weight distribution corresponding to each element.
  • the logical similarity characterizes the similarity of two or two elements in each element pair in the input sequence
  • the local reinforcement matrix characterizes the strong or weak relationship between the two or two elements in each element pair in the input sequence.
  • the combination of the two can be used to calculate the local reinforcement. Attention weight distribution.
  • performing a non-linear transformation based on the logical similarity and the local enhancement matrix to obtain a locally enhanced attention weight distribution corresponding to each element may include: correcting the logical similarity according to the local enhancement matrix to obtain local enhancement.
  • the logical similarity of localization is normalized, and the local enhanced logical similarity is normalized to obtain the local enhanced attention weight distribution corresponding to each element.
  • the computer equipment can modify the logical similarity through the strong and weak connection to obtain the locally strengthened logical similarity.
  • the logical similarity matrix E including the logical similarities of all the element pairs and their respective corresponding strongness-weakness matrices G can be added to logically similar them.
  • the degree matrix is modified (also called offset), and then the logical similarity vector in the modified logical similarity matrix is normalized to obtain a locally enhanced attention weight distribution.
  • the range of values is between (0,1) and the sum of all elements is 1. Normalizing the column vector e i ′ can highlight the largest value and suppress other components far below the maximum value, and can obtain the locally enhanced attention weight distribution corresponding to the ith element in the input sequence.
  • the locally enhanced attention weight distribution A can be calculated by the following formula:
  • the softmax function is a normalization processing function
  • A is a matrix including the attention weight distribution corresponding to each element in the input sequence
  • A ⁇ 1 , ⁇ 2 , ⁇ 3 , ..., ⁇ I ⁇ , ⁇
  • Including I I-dimensional column vectors, the i-th element ⁇ i in A represents the attention weight distribution corresponding to the i-th element x i in the input sequence.
  • the network representation sequence is a sequence composed of a plurality of network representations (vector representations).
  • an input sequence may be input into a neural network model, and a linear representation or a non-linear transformation of model parameters in a hidden layer of the neural network model may output a network representation sequence corresponding to the input sequence.
  • the network corresponding to the input sequence represents the i-th element o i in the sequence O can be calculated by the following formula:
  • ⁇ ij is a constant and v j is a d-dimensional column vector
  • o i is also a d-dimensional column vector, that is:
  • the network representation corresponding to x i o i can be calculated by the following formula:
  • o i ⁇ i1 v 1 + ⁇ i2 v 2 + ⁇ i3 v 3 + ... + ⁇ iI v I.
  • the attention weight distribution corresponding to the current element is a locally enhanced attention weight distribution obtained by modifying the original logical similarity, the weighted summation does not completely consider the value vectors corresponding to all elements in the input sequence. Instead, it focuses on the value vector corresponding to the elements belonging to the local enhancement range. In this way, the output network representation of the current element contains local information associated with the current element.
  • the term "element” used in this application can be used herein to describe the basic constituent units of a vector (including a column vector or a matrix vector).
  • the elements in the input sequence refer to each element in the input sequence.
  • elements in a matrix refer to each column vector that constitutes a matrix
  • elements in a column vector refer to each value in a column vector, that is, “elements” refer to the basic constituent units that constitute a sequence, vector, or matrix.
  • FIG. 3 is a schematic diagram of a process of calculating a network representation sequence corresponding to an input sequence in an embodiment.
  • Z is linearly transformed into a request vector sequence Q, a key vector sequence K, and a value vector sequence V by three different learnable parameter matrices, and then a dot product operation is performed.
  • FIG. 4 is a system architecture diagram using a Gaussian distribution to modify SAN attention weight distribution in one embodiment. Take the input sequence as “Bush, Hold, talk, and Sharon”, and the current element is “Bush,” for example: On the left of Figure 4, use the original SAN to build the basic model to obtain each element pair (from the input sequence in pairs). Elemental composition), the attention weight distribution corresponding to "Bush” calculated based on the logical similarity takes into account all words, and "held” assigns the highest attention weight (the height of the bar represents the attention weight Size), the remaining words are assigned a lower attention weight. Referring to the middle of FIG.
  • the position of the central point of the local enhancement range corresponding to the current element "Bush” is calculated to be approximately 4 using a Gaussian distribution, corresponding to the word "talk" in the input sequence, and the window size of the local enhancement range is approximately 3, also That is, the local strengthening range corresponding to the current element "Bush” is the position corresponding to the 3 words centered on the word "talk".
  • the local strengthening matrix is calculated, and the logic obtained on the left of Figure 4 is calculated using the local strengthening matrix. The similarity is modified so that the distribution of attention weight is concentrated between these three words, and the attention weight assigned by "talk" is the highest. Combining the left side of FIG. 4 with the middle of FIG.
  • the aforementioned network representation generation method of a neural network constructs a local reinforcement matrix based on a request vector sequence corresponding to an input sequence, which can allocate attention weights within the local reinforcement range and strengthen local information.
  • a request vector sequence, a key vector sequence, and a value vector sequence can be obtained.
  • a logical similarity can be obtained according to the request vector sequence and the key vector sequence, and then based on the logical similarity and
  • the local reinforcement matrix is transformed non-linearly to obtain the locally enhanced attention weight distribution, which achieves the modification of the original attention weight.
  • the weighted summation of the value vector sequence according to the locally enhanced attention weight distribution can be enhanced.
  • the network representation sequence of local information can not only strengthen the local information, but also preserve the connection between the long distance elements in the input sequence.
  • constructing a local enhancement matrix according to a request vector sequence may include the following steps:
  • the local enhancement range corresponding to each element in the input sequence is determined by the center point and window size corresponding to each element, and the center point corresponding to each element depends on the request vector corresponding to each element. Therefore, each element can be determined according to the request vector The center point of the corresponding local enhancement range.
  • determining the center point of the local strengthening range corresponding to each element according to the request vector sequence may include: for each element in the input sequence, corresponding to the element in the request vector sequence through a first feedforward neural network Transform the request vector to obtain the first scalar corresponding to the element; use the non-linear transformation function to perform non-linear transformation on the first scalar to obtain a second scalar proportional to the length of the input sequence; use the second scalar as the element The center point of the corresponding local enhancement range.
  • the computer device may determine the center point of the local enhancement range corresponding to each element according to the request vector sequence obtained in step S204. Taking the i-th element x i in the input sequence as an example, the center point of the corresponding local enhancement range can be obtained by the following steps:
  • the computer device maps the request vector q i corresponding to the i-th element into a hidden state through the first feedforward neural network, and Perform a linear transformation on it to obtain the first scalar p i corresponding to the ith element in the input sequence.
  • the first scalar p i is a value belonging to the real number space, and the calculation formula of p i is:
  • tanh (W P q i ) is a part of the first feedforward neural network
  • tanh is the activation function
  • q i is the request vector corresponding to the ith element in the input sequence
  • W P are both trainable linear transformation matrices
  • U P is a d-dimensional column vector
  • This application here and in the following uses a feedforward neural network to map a vector to a hidden state, but it does not limit the method of the feedforward neural network to vector mapping, and the feedforward neural network can be replaced with other neural network models, such as length Time Short Memory (LSTM) model and its variants, Gated Unit and its variants, or simple linear transformations.
  • LSTM length Time Short Memory
  • the computer device converts the first scalar p i into a scalar with a value range of (0,1) through a non-linear transformation function, and then multiplies the input sequence length I to obtain a center with a value range of (0, I).
  • position P i, P i of local reinforcing range is the i-th element corresponding to the center point P i and the input sequence length is proportional to I, P i can be calculated by the following formula:
  • sigmoid is a non-linear transformation function, which can be used to convert p i into a scalar with a value range of (0,1).
  • the sigmoid conversion scalar method can also be used to map any real number to ( 0,1) instead, this application is not limited.
  • the computer equipment uses the calculated P i as the center point of the local enhancement range corresponding to the ith element x i in the input sequence. For example, if the length of the input sequence I is 10 and the calculated P i is equal to 5, then x i corresponds to The central point of the local enhancement range is the fifth element in the input sequence.
  • the attention vector assigned to the value vector of the fifth element in the input sequence is the highest.
  • the computer device may repeat the above steps until the center point of the local enhancement range corresponding to each element is obtained according to each request vector in the request vector sequence.
  • the corresponding window size can be predicted for each element. Then, the computer device can determine the window size of the local enhancement range corresponding to each element according to each request vector in the request vector sequence, that is, each request vector corresponds to a window size.
  • determining the window size of the local enhancement range corresponding to each element according to the request vector sequence may include: for each element in the input sequence, corresponding to the element in the request vector sequence through a second feedforward neural network A linear transformation is performed on the request vector of, to obtain a third scalar corresponding to the element; a non-linear transformation of the third scalar is used to obtain a fourth scalar proportional to the length of the input sequence through a non-linear transformation function; and the fourth scalar is used as the The window size of the local enhancement range corresponding to the element.
  • the computer device may determine the window size of the local enhancement range corresponding to each element according to the request vector sequence obtained in step S204. Taking the i-th element x i in the input sequence as an example, the window size of the corresponding local enhancement range can be obtained by the following steps:
  • the computer equipment maps the request vector q i corresponding to the i-th element into a hidden state through a second feedforward neural network, and A linear transformation is performed on it to obtain a third scalar z i corresponding to the ith element in the input sequence.
  • the third scalar z i is a value belonging to the real number space, and the calculation formula of z i is:
  • tanh (W P q i ) is a part of the second feedforward neural network
  • tanh is the activation function
  • q i is the request vector corresponding to the ith element in the input sequence
  • W P is the hidden state with the center point of the previous calculation
  • U D is a d-dimensional column vector
  • Is a d-dimensional row vector which can map the high-dimensional vector output by the feedforward neural network into a scalar.
  • the computer equipment converts the third scalar z i into a scalar with a value range of (0,1) through a non-linear transformation function, and then multiplies the input sequence length I to obtain a window with a value range of (0, I).
  • size D i, D i is the i-th element corresponding to the range of local reinforcing window size, D i and the input sequence length is proportional to I, D i can be calculated by the following equation:
  • sigmoid is a non-linear transformation function used to convert z i into a scalar with a value range of (0,1).
  • the computer equipment uses the calculated Z i as the window size of the local enhancement range corresponding to the ith element x i in the input sequence. For example, the length I of the input sequence is 10, and the calculated Z i is equal to 7, then x i corresponds to The window size of the local enhancement range is based on the 7 elements centered on the center point. When generating the network representation corresponding to x i , the attention weight is allocated intensively among these 7 elements.
  • the computer device may repeat the above steps until the window size of the local enhancement range corresponding to each element is obtained according to each request vector in the request vector sequence.
  • step S502 and step S504 since the request vectors corresponding to each element in the input sequence are different, the center point and window size corresponding to each element are also different, so the local enhancement range corresponding to each element is also The difference is that it is more flexible to select the local reinforcement range according to the characteristics of each element.
  • the computer equipment can calculate the strong and weak connection between the two elements based on the determined local strengthening range to obtain a local strengthening matrix, wherein the strong and weak connection between the two elements is calculated by the following formula:
  • G ij is the value of the j-th element of the i-th column vector in the local enhancement matrix G.
  • FIG. 6 is a schematic flowchart of determining a local enhancement range according to a request vector sequence in an embodiment.
  • a request vector sequence is first mapped to a hidden state by a feedforward neural network, and then the hidden state is mapped to a scalar in a real number space by using a linear transformation, and then the scalar is converted into a range by a non-linear transformation function sigmoid Is a scalar of (0,1), and then multiplied by the input sequence length I to obtain the center point and window size, thereby determining the local enhancement range, and calculating the local enhancement matrix based on the local enhancement range.
  • constructing the local enhancement matrix according to the request vector sequence may include: determining the center point of the local enhancement range corresponding to each element according to the request vector sequence; determining the window size of the unified local enhancement range according to the key vector sequence; according to the center The size of the point and window determines the local enhancement range corresponding to each element; based on the local enhancement range, the strong and weak relationship between the two elements is calculated to obtain the local enhancement matrix.
  • the manner of determining the local enhancement range corresponding to each element according to the request vector sequence is the same as that described above, and details are not described herein again.
  • the window size of the local enhancement range corresponding to all elements in the input sequence is determined by a uniform window size.
  • the information of all elements in the input sequence needs to be fused.
  • determining the window size of the unified local enhancement range according to the key vector sequence may include: obtaining each key vector in the key vector sequence; calculating an average value of each key vector; and performing a linear transformation on the average value to obtain a fifth Scalar; non-linear transformation of the fifth scalar by a non-linear transformation function to obtain a sixth scalar proportional to the length of the input sequence; using the sixth scalar as the window size for a uniform local enhancement range.
  • the computer device can determine the window size of the unified local enhancement range according to the key vector sequence obtained in step S204, that is, the window size of the local enhancement range corresponding to each element is the same.
  • the unified window size can be determined by The steps are:
  • the computer device obtains the key vector sequence K corresponding to the input sequence, and calculates the average value of all the key vectors in the key vector sequence K.
  • W D is a trainable linear transformation matrix.
  • the computer equipment transforms the fifth scalar z into a scalar with a value range of (0,1) through a non-linear transformation function, and then multiplies the input sequence length I to obtain a window size with a value range of (0, I).
  • D is the window size of the uniform local enhancement range, and D is proportional to the input sequence length I. D can be calculated by the following formula:
  • sigmoid is a non-linear transformation function used to convert z into a scalar with a value range of (0,1).
  • the computer equipment can calculate the strong and weak connection between two and two elements based on the determined local strengthening range to obtain a local strengthening matrix, where the strong and weak connection between two and two elements is calculated by the following formula:
  • G ij is the value of the j-th element of the i-th column vector in the local enhancement matrix G.
  • FIG. 7 is a schematic flowchart of determining a local enhancement range according to a request vector sequence and a key vector sequence in an embodiment.
  • the request vector sequence is mapped to a hidden state through a feedforward neural network
  • the key vector sequence is averaged through an average pool
  • the hidden state is mapped to a scalar in real number space using a linear transformation, respectively
  • the average value is Map into a scalar in the real number space, and then transform the obtained scalar into a scalar with a value range of (0,1) by the non-linear transformation function sigmoid.
  • multiply the scalar by the input sequence length I to get the center point and The size of the window to determine the extent of local enhancement.
  • the key vector sequence corresponding to the input sequence is transformed, and the key vector sequence includes the feature vectors (key vectors) corresponding to all the elements in the input sequence. Therefore, the determined uniform window size considers all the Context information, so that the local enhancement range corresponding to each element determined based on the uniform window size can capture rich context information.
  • performing linear transformation on the source-side vector representation sequence to obtain a request vector sequence, a key vector sequence, and a value vector sequence corresponding to the source-side vector representation sequence may include: dividing the source-side vector representation sequence into multiple Sets of low-dimensional source-side vectors represent subsequences; according to multiple sets of different parameter matrices, different linear transformations are performed on each set of source-side vector representations subsequences to obtain a request vector sequence corresponding to each set of source-side vector representations subsequences , A sequence of key vectors and a sequence of value vectors; the method further comprises: linearly transforming the network representation subsequences corresponding to each set of source end vector representation subsequences to obtain an output network representation sequence.
  • a stacked multi-head neural network can be used to process the source-side vector representation sequence corresponding to the input sequence. Then, the source-side vector representation sequence can be segmented to obtain multiple groups (also called multi-head).
  • the source-side vector of a dimension represents a subsequence.
  • the source vector representation sequence consists of 5 elements, each element is a 512-dimensional column vector, and it is divided into 8 parts, that is, 8 5 ⁇ 64 source end vectors represent subsequences.
  • These eight source-side vector representation subsequences are used as input vectors, and transformed through different subspaces to output eight 5 ⁇ 64 network representation subsequences.
  • the eight network representation subsequences are linearly transformed after splicing, and output A 5x512-dimensional network represents the sequence.
  • the stacked multi-head neural network includes H groups of subspaces.
  • Z ⁇ z 1 , z 2 , z 3 , ..., z I ⁇ is divided into H source vector representation subsequences.
  • Corresponding learnable parameter matrix with Perform linear transformation on Z h ⁇ z h1 , z h2 , z h3 , ..., z hI ⁇ to obtain the corresponding request vector sequence Q h , key vector sequence K h and value vector sequence V h .
  • the three learnable parameter matrices used in each subspace are different, so that each subspace obtains different feature vectors, and different subspaces can focus on different local information.
  • the calculation formula of each element G hi, hj is The formula is to determine the center point Q h corresponding to a range of local strengthening of the i-th element P hi accordance, Q h K h or local reinforcing determined range corresponding to the i-th element of the window size according to D hi, G hi, hj is The value of the j-th element of the i-th column vector in the local enhancement matrix G h , where G hi, hj represents between the j-th element in the input sequence expressed by the h-th subspace and the center point P hi corresponding to the i-th element Strong and weak connection.
  • the method further includes: after obtaining the network representation sequence corresponding to the input sequence, using the network representation sequence as a new source-side vector representation sequence, and returning the linear transformation of the source-side vector representation sequence to obtain the source and source vectors respectively.
  • the steps of the request vector sequence, the key vector sequence, and the value vector sequence corresponding to the end vector representation sequence continue to be executed until the final network representation sequence is output when the loop stop condition is reached.
  • the neural network can be stacked in multiple layers for calculation, whether it is a one-layer neural network or a stack-type neural network, the calculation can be repeated in multiple layers.
  • the output of the previous layer is used as the input of the next layer, and the linear transformation is repeatedly performed to obtain the corresponding request vector sequence, key vector sequence, and value vector sequence, until the output of the current layer is obtained.
  • the number of repetitions can be six, and the network parameters of the neural network at each layer are different. It can be understood that the process of repeating 6 times is actually passing the source-side vector representation sequence of the original input sequence through each The layer network parameters are updated 6 times.
  • the output of the first layer is O L1 .
  • O L1 is used as an input, and O L1 is transformed by the network parameters of the second layer to output the second layer.
  • FIG. 8 is a schematic structural diagram of a multi-layer stacked multi-head self-focus neural network according to an embodiment.
  • the inputs are the same, and they are the outputs of the previous layer.
  • the input is divided into multiple sub-inputs, and the network parameters of each sub-space (also called multiple heads) are paired to each
  • the input is subjected to the same transformation to obtain the output of each subspace.
  • the multiple outputs are spliced to obtain the output of the current layer.
  • the output of the current layer is the input of the next layer. Repeat multiple times, and use the output of the last layer as The final output.
  • the input sequence may be a text sequence to be translated
  • the output network representation sequence is a feature vector corresponding to each word in the translated text, so the translated sentence may be determined according to the output network representation sequence.
  • FIG. 9 it is a schematic flowchart of a neural network representation method in an embodiment, including the following steps:
  • S902 Obtain a source-side vector representation sequence corresponding to the input sequence.
  • the source-side vector representation sequence is divided into multiple sets of low-dimensional source-side vector representation subsequences.
  • S912 Perform a non-linear transformation on the first scalar by using a non-linear transformation function to obtain a second scalar that is proportional to the length of the input sequence.
  • S9164 Perform a non-linear transformation on the third scalar by using a non-linear transformation function to obtain a fourth scalar that is proportional to the length of the input sequence.
  • S9165 Perform a non-linear transformation on the fifth scalar by using a non-linear transformation function to obtain a sixth scalar that is proportional to the length of the input sequence.
  • S926 Fusion the value vectors in the value vector sequence according to the attention weight distribution to obtain a network representation sequence corresponding to the input sequence.
  • the aforementioned network representation generation method of a neural network constructs a local reinforcement matrix based on a request vector sequence corresponding to an input sequence, which can allocate attention weights within a local reinforcement range and strengthen local information.
  • a request vector sequence, a key vector sequence, and a value vector sequence can be obtained.
  • a logical similarity can be obtained according to the request vector sequence and the key vector sequence, and then based on the logical similarity and
  • the local reinforcement matrix is transformed non-linearly to obtain the locally enhanced attention weight distribution, which achieves the modification of the original attention weight.
  • the weighted summation of the value vector sequence according to the locally enhanced attention weight distribution can be enhanced.
  • the network representation sequence of local information can not only strengthen the local information, but also preserve the connection between the long distance elements in the input sequence.
  • steps in the flowchart of FIG. 9 are sequentially displayed according to the directions of the arrows, these steps are not necessarily performed sequentially in the order indicated by the arrows. Unless explicitly stated in this document, the execution of these steps is not strictly limited, and these steps can be performed in other orders. Moreover, at least a part of the steps in FIG. 9 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed in turn or alternately with other steps or at least a part of the sub-steps or stages of other steps.
  • a network representation generating device 1000 for a neural network includes an acquisition module 1002, a linear transformation module 1004, a logical similarity calculation module 1006, and a local reinforcement matrix construction module 1008. 2. Attention weight distribution determination module 1010 and fusion module 1012, where:
  • An obtaining module 1002 configured to obtain a source-side vector representation sequence corresponding to an input sequence
  • the linear transformation module 1004 is configured to perform linear transformation on the source-side vector representation sequence to obtain a request vector sequence, a key vector sequence, and a value vector sequence corresponding to the source-side vector representation sequence;
  • a logical similarity calculation module 1006 configured to calculate a logical similarity between a request vector sequence and a key vector sequence
  • a local enhancement matrix construction module 1008, configured to construct a local enhancement matrix according to a request vector sequence
  • Attention weight distribution determination module 1010 is configured to perform a non-linear transformation based on the logical similarity and the local enhancement matrix to obtain a locally enhanced attention weight distribution corresponding to each element;
  • a fusion module 1012 is configured to fuse the value vectors in the value vector sequence according to the attention weight distribution to obtain a network representation sequence corresponding to the input sequence.
  • the local enhancement matrix construction module 1008 is further configured to: determine the center point of the local enhancement range corresponding to each element according to the request vector sequence; determine the window size of the local enhancement range corresponding to each element according to the request vector sequence; according to the center The size of the point and window determines the local enhancement range corresponding to each element; based on the local enhancement range, the strong and weak relationship between the two elements is calculated to obtain the local enhancement matrix.
  • the local enhancement matrix construction module 1008 is further configured to: determine the center point of the local enhancement range corresponding to each element according to the request vector sequence; determine the window size of the unified local enhancement range according to the key vector sequence; The size of the window determines the local strengthening range corresponding to each element; based on the local strengthening range, the strong and weak relationship between the two elements is calculated to obtain the local strengthening matrix.
  • the local reinforcement matrix construction module 1008 is further configured to: for each element in the input sequence, transform the request vector corresponding to the element in the request vector sequence through the first feedforward neural network to obtain the corresponding element A first scalar; performing a non-linear transformation on the first scalar through a non-linear transformation function to obtain a second scalar that is proportional to the length of the input sequence; and using the second scalar as the center point of the local enhancement range corresponding to the element.
  • the local reinforcement matrix construction module 1008 is further configured to: for each element in the input sequence, linearly transform the request vector corresponding to the element in the request vector sequence through the second feedforward neural network to obtain the element correspondence
  • the third scalar of the third dimensional scalar; the third scalar is nonlinearly transformed by a non-linear transformation function to obtain a fourth scalar proportional to the length of the input sequence; and the fourth scalar is used as the window size of the local enhancement range corresponding to the element.
  • the local reinforcement matrix construction module 1008 is further configured to: obtain each key vector in the key vector sequence; calculate an average value of each key vector; linearly transform the average value to obtain a fifth scalar; and use a non-linear transformation function Non-linear transformation is performed on the fifth scalar to obtain a sixth scalar that is proportional to the length of the input sequence; the sixth scalar is used as the window size of the unified local enhancement range.
  • the local strengthening matrix construction module 1008 is further configured to: use the center point as the expectation of the Gaussian distribution and the window size as the variance of the Gaussian distribution; determine the local strengthening range according to the Gaussian distribution determined according to the mean and the variance; The order of the elements in the input sequence arranges the strong and weak relations between the two elements in order to obtain a local strengthening matrix; where the strong and weak relations between the two elements are calculated by the following formula:
  • the attention weight distribution determination module 1010 is further configured to: modify the logical similarity according to the local enhancement matrix to obtain the locally enhanced logical similarity; perform normalization processing on the locally enhanced logical similarity to obtain the same Distribution of locally enhanced attention weights corresponding to each element.
  • the linear transformation module 1004 is further configured to: divide the source-side vector representation sequence into multiple sets of low-dimensional source-side vector representation subsequences; and according to multiple sets of different parameter matrices, separately represent each set of source-side vector representations.
  • the sub-sequences are subjected to different linear transformations to obtain a request vector sequence, a key vector sequence, and a value vector sequence corresponding to each group of source-side vector representation sub-sequences; the device also includes a splicing module, which The network representation subsequences corresponding to the sequences are spliced and linearly transformed to obtain the output network representation sequence.
  • the device 1000 further includes a loop module, which is configured to obtain the network representation sequence corresponding to the input sequence, use the network representation sequence as a new source vector representation sequence, and return the source vector representation sequence.
  • the linear transformation obtains the request vector sequence, the key vector sequence, and the value vector sequence corresponding to the source-side vector representation sequence, and the execution continues until the final network representation sequence is output when the loop stop condition is reached.
  • the network representation generating device 1000 of the aforementioned neural network constructs a local reinforcement matrix based on a request vector sequence corresponding to an input sequence, and can allocate attention weights within the local reinforcement range to strengthen local information.
  • a request vector sequence, a key vector sequence, and a value vector sequence can be obtained.
  • a logical similarity can be obtained according to the request vector sequence and the key vector sequence, and then based on the logical similarity and
  • the local reinforcement matrix is transformed non-linearly to obtain the locally enhanced attention weight distribution, which achieves the modification of the original attention weight.
  • the weighted summation of the value vector sequence according to the locally enhanced attention weight distribution can be enhanced.
  • the network representation sequence of local information can not only strengthen the local information, but also preserve the connection between the long distance elements in the input sequence.
  • FIG. 11 shows an internal structure diagram of the computer device 120 in one embodiment.
  • the computer device includes the computer device including a processor, a memory, and a network interface connected through a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and a computer program.
  • the processor can implement a network representation generating method of a neural network.
  • a computer program may also be stored in the internal memory, and when the computer program is executed by a processor, the processor may cause the processor to execute a network representation generation method of a neural network.
  • FIG. 11 is only a block diagram of a part of the structure related to the scheme of the present application, and does not constitute a limitation on the computer equipment to which the scheme of the present application is applied.
  • the actual computer equipment may be Include more or fewer parts than shown in the figure, or combine certain parts, or have a different arrangement of parts.
  • the network representation generating device 1000 of the neural network provided in the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in FIG. 11.
  • the memory of the computer device may store each program module of the network representation generating device 1000 constituting the neural network, for example, the acquisition module 1002, the linear transformation module 1004, the logical similarity calculation module 1006, and the local reinforcement matrix construction module shown in FIG. 1008.
  • the computer program constituted by each program module causes the processor to execute the steps in the method for generating a network representation of a neural network in each embodiment of the present application described in this specification.
  • the computer device shown in FIG. 11 may perform step S202 through the obtaining module 1002 in the network representation generating device of the neural network shown in FIG. 10.
  • the computer device may perform step S204 through the linear transformation module 1004.
  • the computer device may execute step S206 through the logical similarity calculation module 1006.
  • the computer device may execute step S208 through the local enhancement matrix construction module 1008.
  • the computer device may perform step S210 through the attention weight distribution determination module 1010.
  • the computer device may execute step S212 through the fusion module 1012.
  • a computer device including a memory and a processor.
  • the memory stores a computer program.
  • the processor causes the processor to perform the following steps: obtaining a source-side vector representation corresponding to the input sequence.
  • Sequence linearly transform the source vector representation sequence to obtain the request vector sequence, key vector sequence, and value vector sequence corresponding to the source vector representation sequence; calculate the logical similarity between the request vector sequence and the key vector sequence; according to Request a vector sequence to build a local enhancement matrix; perform a non-linear transformation based on the logical similarity and the local enhancement matrix to obtain a locally enhanced attention weight distribution corresponding to each element; according to the attention weight distribution, perform a value vector on the value vector sequence Fusion to get the network representation sequence corresponding to the input sequence.
  • the processor when the computer program is executed by the processor to construct the local enhancement matrix according to the request vector sequence, the processor is caused to perform the following steps: determine the center point of the local enhancement range corresponding to each element according to the request vector sequence; and according to the request vector The sequence determines the window size of the local strengthening range corresponding to each element; determines the local strengthening range corresponding to each element according to the center point and the window size; calculates the strong and weak connection between the two elements based on the local strengthening range to obtain a local strengthening matrix.
  • the processor when the computer program is executed by the processor to construct the local enhancement matrix according to the request vector sequence, the processor is caused to perform the following steps: determine the center point of the local enhancement range corresponding to each element according to the request vector sequence; and according to the key vector The sequence determines the window size of the unified local strengthening range; the local strengthening range corresponding to each element is determined according to the center point and the window size; the strong and weak relationship between the two elements is calculated based on the local strengthening range to obtain a local strengthening matrix.
  • the processor when the computer program is executed by the processor to determine the center point of the local enhancement range corresponding to each element according to the request vector sequence, the processor is caused to perform the following steps: for each element in the input sequence, The feedforward neural network transforms the request vector corresponding to the element in the request vector sequence to obtain the first scalar corresponding to the element.
  • the non-linear transformation function is used to perform non-linear transformation on the first scalar to obtain a proportional to the length of the input sequence.
  • the second scalar; the second scalar is taken as the center point of the local enhancement range corresponding to the element.
  • the processor when the computer program is executed by the processor to determine the window size of the local enhancement range corresponding to each element according to the request vector sequence, the processor is caused to perform the following steps: for each element in the input sequence, the second The feedforward neural network performs a linear transformation on the request vector corresponding to the element in the request vector sequence to obtain a third scalar corresponding to the element.
  • the non-linear transformation function is used to perform a non-linear transformation on the third scalar to obtain a ratio proportional to the length of the input sequence.
  • the fourth scalar; the fourth scalar is used as the window size of the local enhancement range corresponding to the element.
  • the processor when the computer program is executed by the processor to determine the window size of the uniform local enhancement range according to the key vector sequence, the processor is caused to perform the following steps: obtaining each key vector in the key vector sequence; calculating each key vector A linear transformation of the average to obtain a fifth scalar; a non-linear transformation of the fifth scalar through a non-linear transformation function to obtain a sixth scalar proportional to the length of the input sequence; using the sixth scalar as a unified local enhancement The window size of the range.
  • the processor when the computer program is executed by the processor to determine the local enhancement range corresponding to each element according to the center point and the window size, the processor is caused to perform the following steps: the center point is expected as a Gaussian distribution, and the window size is taken as The variance of the Gaussian distribution; the local strengthening range is determined according to the Gaussian distribution determined according to the mean and the variance; the computer program is executed by the processor to calculate the strong and weak connection between two elements based on the local strengthening range, and the process of obtaining the local strengthening matrix makes the processing
  • the processor performs the following steps: according to the order of each element in the input sequence, the strong and weak relations between the two elements are arranged in order to obtain a local strengthening matrix; wherein the strong and weak relations between the two elements are calculated by the following formula:
  • the processor when the computer program is executed by the processor to perform a non-linear transformation based on the logical similarity and the local enhancement matrix to obtain a locally enhanced attention weight distribution corresponding to each element, the processor is caused to perform the following steps:
  • the local reinforcement matrix corrects the logical similarity to obtain the local reinforced logical similarity; the local reinforced logical similarity is normalized to obtain the local reinforced attention weight distribution corresponding to each element.
  • the computer program is executed by the processor to linearly transform the source-side vector representation sequence to obtain the request vector sequence, key vector sequence, and value vector sequence corresponding to the source-side vector representation sequence, so that the processor Perform the following steps: Divide the source-side vector representation sequence into multiple sets of low-dimensional source-side vector representation subsequences; perform different linear transformations on each set of source-side vector representation subsequences based on multiple sets of different parameter matrices, and obtain Each set of source-side vectors represents a request vector sequence, a key vector sequence, and a value vector sequence corresponding to a subsequence; when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: The network representation subsequences are spliced and then linearly transformed to obtain the output network representation sequence.
  • the processor when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: after obtaining the network representation sequence corresponding to the input sequence, using the network representation sequence as a new source-side vector representation sequence, and returning to the source side
  • the vector representation sequence is linearly transformed, and the steps of obtaining the request vector sequence, the key vector sequence, and the value vector sequence corresponding to the source-side vector representation sequence are continued until the final network representation sequence is output when the loop stop condition is reached.
  • the above computer equipment constructs a local enhancement matrix based on the request vector sequence corresponding to the input sequence, and can allocate attention weights within the local enhancement range to strengthen local information.
  • a request vector sequence, a key vector sequence, and a value vector sequence can be obtained.
  • a logical similarity can be obtained according to the request vector sequence and the key vector sequence, and then based on the logical similarity and
  • the local reinforcement matrix is transformed non-linearly to obtain the locally enhanced attention weight distribution, which achieves the modification of the original attention weight.
  • the weighted summation of the value vector sequence according to the locally enhanced attention weight distribution can be enhanced.
  • the network representation sequence of local information can not only strengthen the local information, but also preserve the connection between the long distance elements in the input sequence.
  • a computer-readable storage medium on which a computer program is stored.
  • the processor causes the processor to perform the following steps: obtaining a source-side vector representation sequence corresponding to the input sequence; Linearly transform the source vector representation sequence to obtain the request vector sequence, key vector sequence, and value vector sequence corresponding to the source vector representation sequence; calculate the logical similarity between the request vector sequence and the key vector sequence; according to the request vector Construct a local enhancement matrix based on the sequence; perform non-linear transformation based on the logical similarity and the local enhancement matrix to obtain a locally enhanced attention weight distribution corresponding to each element; and according to the attention weight distribution, fuse the value vectors in the value vector sequence. Get the network representation sequence corresponding to the input sequence.
  • the processor when the computer program is executed by the processor to construct the local enhancement matrix according to the request vector sequence, the processor is caused to perform the following steps: determine the center point of the local enhancement range corresponding to each element according to the request vector sequence; and according to the request vector The sequence determines the window size of the local strengthening range corresponding to each element; determines the local strengthening range corresponding to each element according to the center point and the window size; calculates the strong and weak connection between the two elements based on the local strengthening range to obtain a local strengthening matrix.
  • the processor when the computer program is executed by the processor to construct the local enhancement matrix according to the request vector sequence, the processor is caused to perform the following steps: determine the center point of the local enhancement range corresponding to each element according to the request vector sequence; and according to the key vector The sequence determines the window size of the unified local strengthening range; the local strengthening range corresponding to each element is determined according to the center point and the window size; the strong and weak relationship between the two elements is calculated based on the local strengthening range to obtain a local strengthening matrix.
  • the processor when the computer program is executed by the processor to determine the center point of the local enhancement range corresponding to each element according to the request vector sequence, the processor is caused to perform the following steps: for each element in the input sequence, The feedforward neural network transforms the request vector corresponding to the element in the request vector sequence to obtain the first scalar corresponding to the element.
  • the non-linear transformation function is used to perform non-linear transformation on the first scalar to obtain a proportional to the length of the input sequence.
  • the second scalar; the second scalar is taken as the center point of the local enhancement range corresponding to the element.
  • the processor when the computer program is executed by the processor to determine the window size of the local enhancement range corresponding to each element according to the request vector sequence, the processor is caused to perform the following steps: for each element in the input sequence, the second The feedforward neural network linearly transforms the request vector corresponding to the element in the request vector sequence to obtain the third scalar corresponding to each element.
  • the non-linear transformation function is used to perform a non-linear transformation on the third scalar to obtain a value that is equal to the length of the input sequence.
  • the fourth scalar of the ratio; the fourth scalar is used as the window size of the local enhancement range corresponding to the element.
  • the processor when the computer program is executed by the processor to determine the window size of the uniform local enhancement range according to the key vector sequence, the processor is caused to perform the following steps: obtaining each key vector in the key vector sequence; calculating each key vector A linear transformation of the average to obtain a fifth scalar; a non-linear transformation of the fifth scalar through a non-linear transformation function to obtain a sixth scalar proportional to the length of the input sequence; using the sixth scalar as a unified local enhancement The window size of the range.
  • the processor when the computer program is executed by the processor to determine the local enhancement range corresponding to each element according to the center point and the window size, the processor is caused to perform the following steps: the center point is expected as a Gaussian distribution, and the window size is taken as The variance of the Gaussian distribution; the local strengthening range is determined according to the Gaussian distribution determined according to the mean and the variance; the computer program is executed by the processor to calculate the strong and weak connection between two elements based on the local strengthening range, and the process of obtaining the local strengthening matrix makes the processing
  • the processor performs the following steps: according to the order of each element in the input sequence, the strong and weak relations between the two elements are arranged in order to obtain a local strengthening matrix; wherein the strong and weak relations between the two elements are calculated by the following formula:
  • the processor when the computer program is executed by the processor to perform a non-linear transformation based on the logical similarity and the local enhancement matrix to obtain a locally enhanced attention weight distribution corresponding to each element, the processor is caused to perform the following steps:
  • the local reinforcement matrix corrects the logical similarity to obtain the local reinforced logical similarity; the local reinforced logical similarity is normalized to obtain the local reinforced attention weight distribution corresponding to each element.
  • the computer program is executed by the processor to linearly transform the source-side vector representation sequence to obtain the request vector sequence, key vector sequence, and value vector sequence corresponding to the source-side vector representation sequence, so that the processor Perform the following steps: Divide the source-side vector representation sequence into multiple sets of low-dimensional source-side vector representation subsequences; perform different linear transformations on each set of source-side vector representation subsequences based on multiple sets of different parameter matrices, and obtain Each set of source-side vectors represents a request vector sequence, a key vector sequence, and a value vector sequence corresponding to a subsequence; when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: The network representation subsequences are spliced and then linearly transformed to obtain the output network representation sequence.
  • the processor when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: after obtaining the network representation sequence corresponding to the input sequence, using the network representation sequence as a new source-side vector representation sequence, and returning to the source side
  • the vector representation sequence is linearly transformed, and the steps of obtaining the request vector sequence, the key vector sequence, and the value vector sequence corresponding to the source-side vector representation sequence are continued until the final network representation sequence is output when the loop stop condition is reached.
  • the computer-readable storage medium described above constructs a local enhancement matrix based on the request vector sequence corresponding to the input sequence, which can allocate attention weights within the local enhancement range and strengthen local information.
  • a request vector sequence, a key vector sequence, and a value vector sequence can be obtained.
  • a logical similarity can be obtained according to the request vector sequence and the key vector sequence, and then based on the logical similarity and
  • the local reinforcement matrix is transformed non-linearly to obtain the locally enhanced attention weight distribution, which achieves the modification of the original attention weight.
  • the weighted summation of the value vector sequence according to the locally enhanced attention weight distribution can be enhanced.
  • the network representation sequence of local information can not only strengthen the local information, but also preserve the connection between the long distance elements in the input sequence.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM dual data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Synchlink DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及一种神经网络的网络表示生成方法、装置、存储介质和设备,方法包括:获取与输入序列对应的源端向量表示序列;对源端向量表示序列进行线性变换,分别得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列;计算请求向量序列与键向量序列之间的逻辑相似度;根据请求向量序列构建局部强化矩阵;基于逻辑相似度和局部强化矩阵进行非线性变换,得到与各元素对应的局部强化的注意力权重分布;按照注意力权重分布,对值向量序列中的值向量进行融合,得到输入序列对应的网络表示序列。本申请提供的方案生成的网络表示序列不仅能强化局部信息,还能保留输入序列中长距离元素之间的联系。

Description

神经网络的网络表示生成方法、装置、存储介质和设备
本申请要求于2018年09月04日提交的申请号为201811027795.X、发明名称为“神经网络的网络表示生成方法、装置、存储介质和设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种神经网络的网络表示生成方法、装置、存储介质和设备。
背景技术
注意力机制(Attention Mechanism)是针对神经网络中的编码器与解码器的隐藏状态之间的依赖关系建立模型的方法,注意力机制被广泛应用到基于深度学习的自然语言处理(NLP,Natural Language Processing)的各个任务中。
SAN(Self-Attention Network,自关注神经网络)是一种基于自关注机制的神经网络模型,属于注意力模型中的一种,能够为输入序列中的每个元素对计算一个注意力权重,从而可以捕获长距离依赖关系,各个元素对应的网络表示并不会受到各个元素间距离的影响。然而,SAN完整地考虑输入序列中的每个元素,所以,需要计算每个元素与所有元素之间的注意力权重,这在一定程度上分散了权重的分布,进而弱化了元素之间的联系。
发明内容
基于此,有必要提供一种神经网络的网络表示生成方法、装置、存储介质和设备,用于解决现有的自关注神经网络考虑每个元素与所有元素之间的注意力权重会弱化元素之间的联系的技术问题。
一方面,提供了一种神经网络的网络表示生成方法,用于计算机设备中,所述方法包括:
获取与输入序列对应的源端向量表示序列;
对所述源端向量表示序列进行线性变换,分别得到与所述源端向量表示序列对应的请求向量序列、键向量序列和值向量序列;
计算所述请求向量序列与所述键向量序列之间的逻辑相似度;
根据所述请求向量序列构建局部强化矩阵;
基于所述逻辑相似度和所述局部强化矩阵进行非线性变换,得到与各所述元素对应的局部强化的注意力权重分布;
按照所述注意力权重分布,对所述值向量序列中的值向量进行融合,得到所述输入序列对应的网络表示序列。
另一方面,提供了一种神经网络的网络表示生成装置,所述装置包括:
获取模块,用于获取与输入序列对应的源端向量表示序列;
线性变换模块,用于对所述源端向量表示序列进行线性变换,分别得到与所述源端向量表示序列对应的请求向量序列、键向量序列和值向量序列;
逻辑相似度计算模块,用于计算所述请求向量序列与所述键向量序列之间的逻辑相似度;
局部强化矩阵构建模块,用于根据所述请求向量序列构建局部强化矩阵;
注意力权重分布确定模块,用于基于所述逻辑相似度和所述局部强化矩阵进行非线性变换,得到与各所述元素对应的局部强化的注意力权重分布;
融合模块,用于按照所述注意力权重分布,对所述值向量序列中的值向量进行融合,得到所述输入序列对应的网络表示序列。
又一方面,提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行上述神经网络的网络表示生成方法的步骤。
再一方面,提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行上述神经网络的网络表示生成方法的步骤。
上述神经网络的网络表示生成方法、装置、存储介质和设备,基于输入序列对应的请求向量序列来构建局部强化矩阵,能够在局部强化范围内分配注意力权重,强化局部信息。在对输入序列对应的源端向量表示序列进行线性变换后,可得到请求向量序列、键向量序列和值向量序列,可根据请求向量序列、键向量序列得到逻辑相似度,然后基于逻辑相似度和局部强化矩阵进行非线性变换,得到局部强化的注意力权重分布,实现了对原有的注意力权重的修正,再根据局部强化的注意力权重分布对值向量序列进行加权求和,可以得到强化了局部信息的网络表示序列,得到的网络表示序列不仅能强化局部信息,还能保留输入序列中长距离元素之间的联系。
附图说明
图1为一个实施例中神经网络的网络表示生成方法的应用环境图;
图2为一个实施例中神经网络的网络表示生成方法的流程示意图;
图3为一个实施例中计算输入序列对应的网络表示序列的过程示意图;
图4为一个实施例中使用高斯分布修正SAN注意力权重分布的***架构图;
图5为一个实施例中根据请求向量序列构建局部强化矩阵的流程示意图;
图6为一个实施例中根据请求向量序列确定局部强化范围的流程示意图;
图7为一个实施例中根据请求向量序列、键向量序列确定局部强化范围的流程示意图;
图8为一个实施例中多层的堆叠式多头自关注神经网络的结构示意图;
图9为一个实施例中神经网络的网络表示生成方法的流程示意图;
图10为一个实施例中神经网络的网络表示生成装置的结构框图;
图11为一个实施例中计算机设备的结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的实施例仅仅用以解释本申请,并不用于限定本申请。
图1为一个实施例中神经网络的网络表示生成方法的应用环境图。参照图1,该神经网络的网络表示生成方法应用于神经网络的网络表示生成***。该神经网络的网络表示生成***包括终端110和计算机设备120。终端110和计算机设备120通过蓝牙、USB(Universal Serial Bus,通用串行总线)或网络连接,终端110可向计算机设备120发送待处理的输入序列,可以是实时发送的也可以是非实时发送的,计算机设备120用于接收输入序列,并对输入序列进行变换后输出相应的网络表示序列。终端110可以是台式终端或移动终端,移动终端可以手机、平板电脑、笔记本电脑等中的至少一种。计算机设备120可以是独立的服务器或终端,也可以是多个服务器组成的服务器集群,还可以是提供云服务器、云数据库、云存储和CDN等基础云计算服务的云服务器。
需要说明的是,上述的应用环境只是一个示例,在一些实施例中,计算机设备120可以不通过终端110,直接获取输入序列。例如,当计算机设备为手机时,手机可直接获取输入序列(比如即时文本消息中各词所形成的序列)后,利用手机上配置的神经网络的网络表示生成装置对输入序列进行变换,输出输入序列对应的网络表示序列。
如图2所示,在一个实施例中,提供了一种神经网络的网络表示生成方法。本实施例主要以该方法应用于上述图1中的计算机设备120来举例说明。参照图2,该神经网络的网络表示生成方法可以包括如下步骤:
S202,获取与输入序列对应的源端向量表示序列。
其中,输入序列是待进行变换后得到相应的网络表示序列的序列。输入序列中包括一组有序排列的元素,以包括I个元素的输入序列为例,输入序列可以用X={x 1,x 2,x 3,...,x I}表示,输入序列的长度为I,且I为正整数。
在需要对输入序列进行翻译的场景中,输入序列可以是待翻译文本对应的词序列,输入序列中的各个元素则为词序列中的各个词。若待翻译文本为中文文本,则词序列可以是对待翻译文本进行分词后,将得到的各个词语按词序排列所形成的序列;若待翻译文本为英文文本,则词序列是各个单词按词序排列所形成的序列。比如,待翻译文本为“Bush held a talk with Sharon”,相应的输入序列X为{Bush,held,a,talk,with,Sharon}。
源端向量表示序列是输入序列中的每个元素相应的源端向量表示所构成的序列。源端向量表示序列中的每个向量表示与输入序列中的每个元素一一对应,源端向量表示序列可以用Z={z 1,z 2,z 3,...,z I}表示。
其中,计算机设备可将输入序列中的各个元素转换成固定长度的向量(即Word Embedding,词嵌入)。在一个实施例中,神经网络的网络表示生成方法应用于神经网络模型中,则计算机设备可通过神经网络模型的第一层将输入序列中的各个元素转换成相应的向量,比如,将输入序列中的第i个元素x i转化成一个d维的列向量即为z i,再对输入序列中各元素对应的向量进行组合,得到与输入序列对应的源端向量表示序列,即I个d维的列向量所构成的向量序列,d为正整数。当然,计算机设备也可接收其它设备发送的与输入序列对应的源端向量表示序列。z i以及下文中提到的列向量均可以是行向量,本文为方便解释计算过程,统一用列向量进行描述。
S204,对源端向量表示序列进行线性变换,分别得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列。
其中,线性变换可以将属于一个向量空间的向量映射到另一个向量空间,向量空间是相同维度的多个向量所构成的集合。在一个实施例中,计算机设备可通过三个不同的可学习参数矩阵对源端向量表示序列进行线性变换,从而将源端向量表示序列分别映射至三个不同的向量空间中,得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列。
在一个实施例中,神经网络的网络表示生成方法应用于基于SAN(自关注神经网络)的模型中,那么请求向量序列、键向量序列和值向量序列均由源端的输入序列对应的源端向量表示序列进行线性变换得到。在另一个实施例中,神经网络的网络表示生成方法应用于包括Encoder-Decoder(编码器-解码器)结构的神经网络模型中,那么键向量序列和值向量序列由编码器对输入序列对应的源端向量表示序列进行编码得到,即键向量序列和值向量序列是编码器的输出,而请求向量序列为解码器的输入,比如可以是目标端向量表示序列,目标端 向量表示序列可以是解码器输出的输出序列中各个元素对应的向量表示。
在一个实施例中,计算机设备可通过以下公式,利用三个不同的可学习参数矩阵W Q、W K和W V对源端向量表示序列Z进行线性变换,得到请求向量序列Q、键向量序列K和值向量序列V:
Q=Z·W Q
K=Z·W K
V=Z·W V
其中,输入序列X={x 1,x 2,x 3,...,x I}包括I个元素;源端向量表示序列Z={z 1,z 2,z 3,...,z I}中各个元素为d维列向量,即Z为I个d维列向量构成的向量序列,可记为I×d的矩阵;可学习参数矩阵W Q、W K和W V为d×d的矩阵;请求向量序列Q、键向量序列和K值向量序列V为I×d的矩阵。
S206,计算请求向量序列与键向量序列之间的逻辑相似度。
其中,逻辑相似度用于度量输入序列中的每个元素与该输入序列中其它元素之间的相似性。在生成每个元素对应的网络表示时,可基于该相似性为输入序列中其它元素对应的值向量分配相应的注意力权重,从而使得输出的每个元素对应的网络表示考虑了该元素与其它元素之间的联系,使得生成的网络表示能够更准确地表达每个元素的特征,涵盖更丰富的信息。
在一个实施例中,神经网络的网络表示生成方法应用于包括Encoder-Decoder(编码器-解码器)结构的神经网络模型中,则请求向量序列为目标端向量表示序列,计算得到的逻辑相似度用于表示目标端向量表示序列与输入序列对应的键向量序列之间的相似性,基于该相似性为输入序列对应的值向量序列分配相应的注意力权重,可以使源端输出的每个元素的网络表示能够考虑目标端输入的目标端向量表示序列的影响。
在一个实施例中,计算机设备可通过余弦相似性公式计算请求向量序列Q与键向量序列K之间的逻辑相似度矩阵E,即:
Figure PCTCN2019100212-appb-000001
其中,K T表示键向量序列K的转置矩阵;d为输入序列中每个元素x i被转换为源端向量表示z i的维度,d也是x i对应的网络表示的维度,也是网络隐藏状态向量的维度,在上述公式中除以
Figure PCTCN2019100212-appb-000002
是为了减小内积,降低计算速度。
下面说明逻辑相似度矩阵E的计算过程:
Q=(q 1,q 2,...,q i,...,q I)、K=(k 1,k 2,...,k i,...,k I);q i、k i为d维列向量,分别为源端向量表示z i对应的请求向量和键向量;在逻辑相似度矩阵E=(e 1,e 2,...,e i,...,e I)中,e i的各个元素为源端向量表示z i对应的请求向量q i与输入序列中所有元素对应的键向量k 1,k 2,...,k i,...,k I之间的 逻辑相似度,e i是E第i列的元素,e i为I维列向量,计算公式为
Figure PCTCN2019100212-appb-000003
实质上,e i隐含了第i个元素x i与输入序列中所有元素x 1,x 2,...,x i,...,x I所构成的I组元素对中两个元素之间的联系。逻辑相似度矩阵E为I×I的矩阵,逻辑相似度矩阵E为:
Figure PCTCN2019100212-appb-000004
S208,根据请求向量序列构建局部强化矩阵。
其中,局部强化矩阵中列向量的每个元素代表了输入序列中两两元素之间的强弱联系。在生成输入序列中每个元素对应的网络表示时,可通过局部强化矩阵强化输入序列中其它元素中与当前元素联系较大的元素对该网络表示的影响,就可以相对弱化与当前元素联系不大的元素对该网络表示的影响。局部强化矩阵可使得在考虑其它元素对当前元素的网络表示的影响时,使得考虑的范围限制在局部元素中,而不是输入序列中的所有元素,这样,在分配注意力权重时,就可以偏向于在局部元素中分配,为局部元素中某个元素对应的值向量分配的注意力权重的大小和该元素与当前元素之间的强弱联系相关,也就是会为与当前元素联系较强的元素对应的值向量分配较大的注意力权重。
以输入序列“Bush held a talk with Sharon”举例说明,在SAN模型中,在输出元素“Bush”对应的网络表示时,会完整地考虑输入序列中全部的元素“Bush”、“held”、“a”、“talk”、“with”和“Sharon”各自对应的值向量,会为所有元素各自对应的值向量分配相应的注意力权重,这在一定程度上分散了注意力权重的分布,进而弱化了“Bush”与相邻元素之间的联系。
而本实施例中的神经网络的网络表示生成方法,在输出“Bush”对应的网络表示时,可以使得注意力权重在局部强化范围内分配。在输出“Bush”对应的网络表示时,如果元素“Bush”与元素“held”之间的联系很强,会为“held”对应的值向量分配较高的注意力权重,那么与“held”一样,属于“Bush”对应的局部强化范围内的局部元素中的“a talk”也会被注意到,从而被分配到较高的注意力权重,这样,短语“held a talk”中各词对应的信息(值向量)就会被捕获并与“Bush”相关联,使得输出的“Bush”的网络表示既能够表达局部信息,还能保留与距离较远的元素之间的依存关系。
因而,计算机设备在生成每个元素对应的网络表示时,需要确定与当前元素对应的局部强化范围,使得对应当前元素的注意力权重的分配限制在该局部强化范围内。
在一个实施例中,局部强化范围可根据局部强化范围的中心点以及局部强化范围的窗口大小这两个变量确定,中心点是指在生成当前元素的网络表示时分配了最高的注意力权重的元素在输入序列中的位置,窗口大小是指局部强化范围的长度,决定了注意力权重集中在多少个元素内分配,则以中心点为中心、以窗口大小为跨度所涉及到的元素即为局部强化范围。由于每个元素对应的局部强化范围都与每个元素自身相关,是与每个元素对应的,而非固定在某个范围内,使得生成的每个元素的网络表示可以灵活地捕获丰富的上下文信息。
在一个实施例中,计算机设备可按照中心点、窗口大小确定各个元素对应的局部强化范围,该步骤可以包括:将中心点作为高斯分布的期望、将窗口大小作为高斯分布的方差;根据按照均值和方差确定的高斯分布确定局部强化范围;计算机设备可基于确定的局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵,其中,两两元素之间的强弱联系通过以下公式计算得到:
Figure PCTCN2019100212-appb-000005
其中,G ij表示输入序列中第j个元素与第i个元素对应的中心点P i之间的强弱联系,G ij为局部强化矩阵G中第i列向量的第j个元素的值;P i表示第i个元素对应的局部强化范围的中心点;D i表示第i个元素对应的局部强化范围的窗口大小。
根据公式(2)可知,局部强化矩阵G为I×I的矩阵,包括I个列向量,且每个列向量的维度为I。局部强化矩阵G的第i列向量中各个元素的值是基于输入序列中第i个元素对应的局部强化范围确定的,公式(2)是关于中心点P i对称的函数,分子代表了输入序列中第j个元素与第i个元素对应的中心点P i之间的距离,距离越近,与G ij越大,说明第j个元素与第i个元素联系越强,反之距离越远,G ij越小,说明第j个元素与第i个元素联系越弱。也就是,在生成第i个元素对应的网络表示时,注意力权重会在靠近中心点P i的元素之间集中分配。
需要说明的是,采用根据高斯分布变形的公式(2)计算G ij只是一个示例,在一些实施例中,在确定了局部强化范围对应的中心点和窗口大小后,可将中心点作为期望、窗口大小作为方差,并通过其它具有期望和方差的分布计算G ij的值,从而得到局部强化矩阵G,比如泊松分布或二项分布等。
S210,基于逻辑相似度和局部强化矩阵进行非线性变换,得到与各元素对应的局部强化的注意力权重分布。
逻辑相似度表征了输入序列中每个元素对中两两元素的相似性,局部强化矩阵表征了输入序列中每个元素对中两两元素的强弱联系,二者结合可用于计算局部强化的注意力权重分布。
在一个实施例中,基于逻辑相似度和局部强化矩阵进行非线性变换,得到与各元素对应的局部强化的注意力权重分布,可以包括:根据局部强化矩阵对逻辑相似度进行修正,得到局部强化的逻辑相似度;对局部强化的逻辑相似度进行归一化处理,得到与各元素对应的局部强化的注意力权重分布。
其中,计算机设备在得到输入序列中每个元素对中两两元素的逻辑相似度和强弱联系后,可通过强弱联系对逻辑相似度进行修正,得到局部强化的逻辑相似度。在一个实施例中,可将包括了所有元素对各自对应的逻辑相似度的逻辑相似度矩阵E,与包括了所有元素对各自对应的强弱联系的局部强化矩阵G相加,以对逻辑相似度矩阵进行修正(也叫偏移),再对修正后的逻辑相似度矩阵中的逻辑相似度向量进行归一化处理,得到局部强化的注意力权重分布。
对修正后的逻辑相似度矩阵E′中的逻辑相似度向量进行归一化处理,是以每个列向量e i′为单位进行归一化处理,即列向量e i′中每一个元素的值范围都位于(0,1)之间、且所有元素的和为1。对列向量e i′进行归一化,能够凸显其中最大的值并抑制远低于最大值的其他分量,可以得到输入序列中第i个元素对应的局部强化的注意力权重分布。
在一个实施例中,局部强化的注意力权重分布Α可通过以下公式计算得到:
A=softmax(E+G),
其中,softmax函数为归一化处理函数,Α为包括输入序列中每个元素对应的注意力权重分布的矩阵;Α={α 123,...,α I},Α包括I个I维的列向量,Α中第i个元素α i代表输入序列中第i个元素x i对应的注意力权重分布。
S212,按照注意力权重分布,对值向量序列中的值向量进行融合,得到输入序列对应的网络表示序列。
其中,网络表示序列是多个网络表示(向量表示)所构成的序列。本实施例中,可以将输入序列输入至神经网络模型中,经过神经网络模型隐藏层中模型参数的线性变换或非线性变换,即可输出与该输入序列对应的网络表示序列。
其中,在输出当前元素x i对应的网络表示时,计算机设备从局部强化后的注意力权重分布矩阵中获取与该元素对应的注意力权重分布α i,以该元素对应的注意力权重分布α i中的每个元素为权重系数,对值向量序列中的值向量进行加权求和,得到当前元素x i对应的网络表示o i,则输入序列对应的网络表示序列O由多个网络表示组成,比如O={o 1,o 2,o 3,...,o I}。
输入序列对应的网络表示序列O中第i个元素o i可通过以下公式计算得到:
Figure PCTCN2019100212-appb-000006
由于α ij是常数,v j是d维列向量,则o i也是d维列向量,即:
在输入序列中第i个元素x i对应的注意力权重分布为α i={α i1i2i3,...,α iI}时,输入序列对应 的K值向量序列为V={v 1,v 2,v 3,...,v I},则x i对应的网络表示o i可通过以下公式计算得到:
o i=α i1v 1i2v 2i3v 3+...+α iIv I
由于当前元素对应的注意力权重分布是基于原有的逻辑相似度进行修正后得到的局部强化的注意力权重分布,因而加权求和时并不是完整地考虑输入序列中所有元素对应的值向量,而是着重考虑属于局部强化范围内的元素对应的值向量,这样,输出的当前元素的网络表示包含了与当前元素相关联的局部信息。
需要说明的是,本申请所使用的术语“元素”可在本文中用于描述向量(包括列向量或矩阵向量)基本组成单位,比如,“输入序列中的元素”是指输入序列中的各个输入,“矩阵中的元素”是指构成矩阵的各个列向量,“列向量中的元素”是指列向量中的各个数值,即“元素”是指构成序列、向量或矩阵的基本组成单位。
图3为一个实施例中计算输入序列对应的网络表示序列的过程示意图。参照图3,在得到输入序列X对应的向量化表示Z后,Z被三个不同的可学习参数矩阵线性变换为请求向量序列Q、键向量序列K和值向量序列V,接着通过点积运算计算每个键值对之间的逻辑相似度,得到逻辑相似度矩阵E,然后根据Q或K构建局部强化矩阵G,利用G对E进行修正得到局部强化的逻辑相似度矩阵E′,接着利用softmax函数对E′进行归一化处理,得到局部强化的注意力权重分布矩阵Α,最后对Α与值向量序列V进行点积运算,输出网络表示序列O。
图4为一个实施例中使用高斯分布修正SAN注意力权重分布的***架构图。以输入序列为“Bush held a talk with Sharon”、且当前元素为“Bush”举例说明:在图4左边,利用原有的SAN构建基本模型,以得到每个元素对(由输入序列中两两元素构成)之间的逻辑相似度,基于该逻辑相似度计算得到的“Bush”对应的注意力权重分布考虑了所有词语,“held”分配的注意力权重最高(柱条高度代表注意力权重的大小),其余词语分配的注意力权重较低。参照图4中间,利用高斯分布计算当前元素“Bush”对应的局部强化范围的中心点的位置约等于4,对应了输入序列中的词“talk”,局部强化范围的窗口大小约等于3,也就是当前元素“Bush”对应的局部强化范围是以词“talk”为中心的3个词所对应的位置,基于确定的局部强化范围计算局部强化矩阵,利用局部强化矩阵对图4左边得到的逻辑相似度进行修正,使得修正后注意力权重的分配集中在这3个词之间,且“talk”分配的注意力权重最高。结合图4左边和图4中间,得到图4右边经过修正后的与当前元素“Bush”对应的注意力权重分布,即词组“held a talk”分得了大部分注意力权重,在计算“Bush”对应的网络表示时,会着重考虑“held a talk”这三个词分别对应的值向量,这样“held a talk”的信息就会被捕获并与“Bush”相关联。
上述神经网络的网络表示生成方法,基于输入序列对应的请求向量序列来构建局部强化 矩阵,能够在局部强化范围内分配注意力权重,强化局部信息。在对输入序列对应的源端向量表示序列进行线性变换后,可得到请求向量序列、键向量序列和值向量序列,可根据请求向量序列、键向量序列得到逻辑相似度,然后基于逻辑相似度和局部强化矩阵进行非线性变换,得到局部强化的注意力权重分布,实现了对原有的注意力权重的修正,再根据局部强化的注意力权重分布对值向量序列进行加权求和,可以得到强化了局部信息的网络表示序列,得到的网络表示序列不仅能强化局部信息,还能保留输入序列中长距离元素之间的联系。
如图5所示,在一个实施例中,根据请求向量序列构建局部强化矩阵,可以包括如下步骤:
S502,根据请求向量序列确定各个元素对应的局部强化范围的中心点。
其中,输入序列中各个元素对应的局部强化范围由各个元素对应的中心点和窗口大小确定,而各个元素对应的中心点依赖于各个元素所对应的请求向量,因此,可根据请求向量确定各个元素对应的局部强化范围的中心点。
在一个实施例中,根据请求向量序列确定各个元素对应的局部强化范围的中心点,可以包括:对于输入序列中的每个元素,通过第一前馈神经网络对请求向量序列中与该元素对应的请求向量进行变换,得到该元素对应的第一标量;通过非线性变换函数,对该第一标量进行非线性变换,得到与输入序列长度成比例的第二标量;将第二标量作为该元素对应的局部强化范围的中心点。
其中,计算机设备可根据在步骤S204中获得的请求向量序列确定各个元素对应的局部强化范围的中心点。以输入序列中第i个元素x i为例,其对应的局部强化范围的中心点可通过以下步骤得到:
1)计算机设备通过第一前馈神经网络将第i个元素对应的请求向量q i映射成一个隐藏状态,并通过
Figure PCTCN2019100212-appb-000007
对其进行线性变换,得到输入序列中第i个元素对应的第一标量p i,第一标量p i是属于实数空间的一个数值,且p i的计算公式为:
Figure PCTCN2019100212-appb-000008
其中,tanh(W Pq i)是第一前馈神经网络中的一部分,tanh是激活函数,q i是输入序列中第i个元素对应的请求向量,
Figure PCTCN2019100212-appb-000009
与W P均为可训练的线性变换矩阵,
Figure PCTCN2019100212-appb-000010
是U P的转置矩阵,U P是一个d维列向量,则
Figure PCTCN2019100212-appb-000011
是一个d维行向量,这样可以将前馈神经网络输出的高维向量映射为一个标量。本申请此处以及下文中使用前馈神经网络将向量映射为隐藏状态,但并不限定该前馈神经网络对向量的映射方法,且该前馈神经网络可替换为其它神经网络模型,比如长短时记忆(Long Short-Term Memory,LSTM)模型及其变种,门控单元(Gated Unit)及其 变种,或者简单的线性变换等。
2)计算机设备通过非线性变换函数将第一标量p i转换为一个值域为(0,1)的标量,然后与输入序列长度I相乘,得到一个值域为(0,I)的中心点位置P i,P i即为第i个元素对应的局部强化范围的中心点,且P i与输入序列长度I成比例,可通过以下公式计算得到P i
P i=I·sigmoid(p i)
其中,sigmoid是非线性变换函数,用于可将p i转换为一个值域为(0,1)的标量,此处以及下文中采用sigmoid转换标量的方式也可以由其它可将任意实数映射到(0,1)之间的方法代替,本申请并不作限制。
计算机设备将计算得到的P i作为输入序列中第i个元素x i对应的局部强化范围的中心点,比如,输入序列的长度I为10,计算得到的P i等于5,则x i对应的局部强化范围的中心点是输入序列中第5个元素,在生成x i对应的网络表示时,为输入序列中第5个元素的值向量分配的注意力权重最高。
计算机设备可重复上述步骤,直至根据请求向量序列中每个请求向量得到每个元素对应的局部强化范围的中心点。
S504,根据请求向量序列确定各个元素对应的局部强化范围的窗口大小。
为灵活地预测窗口大小,可针对每个元素都预测相应的窗口大小。那么,计算机设备可根据请求向量序列中的各个请求向量确定各个元素对应的局部强化范围的窗口大小,即每个请求向量对应一个窗口大小。
在一个实施例中,根据请求向量序列确定各个元素对应的局部强化范围的窗口大小,可以包括:对于输入序列中的每个元素,通过第二前馈神经网络对请求向量序列中与该元素对应的请求向量进行线性变换,得到该元素对应的第三标量;通过非线性变换函数,对该第三标量进行非线性变换,得到与输入序列长度成比例的第四标量;将第四标量作为该元素对应的局部强化范围的窗口大小。
其中,计算机设备可根据在步骤S204中获得的请求向量序列确定各个元素对应的局部强化范围的窗口大小。以输入序列中第i个元素x i为例,其对应的局部强化范围的窗口大小可通过以下步骤得到:
1)计算机设备通过第二前馈神经网络将第i个元素对应的请求向量q i映射成一个隐藏状态,并通过
Figure PCTCN2019100212-appb-000012
对其进行线性变换,得到输入序列中第i个元素对应的第三标量z i,第三标量z i是属于实数空间的一个数值,且z i的计算公式为:
Figure PCTCN2019100212-appb-000013
其中,tanh(W Pq i)是第二前馈神经网络中的一部分,tanh是激活函数,q i是输入序列中 第i个元素对应的请求向量,W P是与前文计算中心点隐藏状态使用的相同的参数矩阵,
Figure PCTCN2019100212-appb-000014
为可训练的线性变换矩阵,
Figure PCTCN2019100212-appb-000015
是U D的转置矩阵,U D是一个d维列向量,则
Figure PCTCN2019100212-appb-000016
是一个d维行向量,这样可以将前馈神经网络输出的高维向量映射为一个标量。
2)计算机设备通过非线性变换函数将第三标量z i转换为一个值域为(0,1)的标量,然后与输入序列长度I相乘,得到一个值域为(0,I)的窗口大小D i,D i即为第i个元素对应的局部强化范围的窗口大小,且D i与输入序列长度I成比例,可通过以下公式计算得到D i
D i=I·sigmoid(z i);
其中,sigmoid是非线性变换函数,用于将z i转换为一个值域为(0,1)的标量。
计算机设备将计算得到的Z i作为输入序列中第i个元素x i对应的局部强化范围的窗口大小,比如,输入序列的长度I为10,计算得到的Z i等于7,则x i对应的局部强化范围的窗口大小是以中心点为中心的7个元素,在生成x i对应的网络表示时,注意力权重在这7个元素中集中分配。
计算机设备可重复上述步骤,直至根据请求向量序列中每个请求向量得到每个元素对应的局部强化范围的窗口大小。
S506,按照中心点、窗口大小确定各个元素对应的局部强化范围。
从步骤S502和步骤S504可以看出,由于输入序列中每个元素对应的请求向量都不同,因此每个元素对应的中心点、窗口大小也就不同,那么每个元素对应的局部强化范围也就不同,是依据每个元素本身的特性来选择局部强化范围,更为灵活。
S508,基于局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵。
其中,计算机设备可基于确定的局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵,其中,两两元素之间的强弱联系通过以下公式计算得到:
Figure PCTCN2019100212-appb-000017
G ij为局部强化矩阵G中第i列向量的第j个元素的值。
图6为一个实施例中根据请求向量序列确定局部强化范围的流程示意图。参照图6,先通过前馈神经网络将请求向量序列映射成隐藏状态,然后利用线性变换将隐藏状态映射成一个在实数空间的标量,再通过非线性变换函数sigmoid将该标量转换成一个值域为(0,1)的标量,再与输入序列长度I相乘,得到中心点和窗口大小,从而确定局部强化范围,基于局部强化范围计算得到局部强化矩阵。
在上述实施例中,通过对输入序列中每个元素对应的请求向量进行变换,可灵活地为每个元素确定相应的局部强化范围,而非对输入序列固定一个局部强化范围,能够有效地提升输入序列中长距离元素之间的依赖联系。
在一个实施例中,根据请求向量序列构建局部强化矩阵,可以包括:根据请求向量序列确定各个元素对应的局部强化范围的中心点;根据键向量序列确定统一的局部强化范围的窗口大小;按照中心点、窗口大小确定各个元素对应的局部强化范围;基于局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵。
在本实施例中,根据请求向量序列确定各个元素对应的局部强化范围的方式与前文相同,在此不再赘述。而对于窗口大小,考虑全局上下文信息,输入序列中所有元素所对应的局部强化范围的窗口大小由一个统一的窗口大小决定,则在确定窗口大小时需要融合输入序列中所有元素的信息。
在一个实施例中,根据键向量序列确定统一的局部强化范围的窗口大小,可以包括:获取键向量序列中的各个键向量;计算各键向量的平均值;对平均值进行线性变换得到第五标量;通过非线性变换函数对第五标量进行非线性变换,得到与输入序列长度成比例的第六标量;将第六标量作为统一的局部强化范围的窗口大小。
其中,计算机设备可根据在步骤S204中获得的键向量序列确定统一的局部强化范围的窗口大小,也就是每个元素对应的局部强化范围的窗口大小是相同的,该统一的窗口大小可通过以下步骤得到:
1)计算机设备获取输入序列对应的键向量序列K,计算键向量序列K中所有键向量的平均值
Figure PCTCN2019100212-appb-000018
Figure PCTCN2019100212-appb-000019
2)计算机设备对得到的平均值
Figure PCTCN2019100212-appb-000020
进行线性变换,生成一个在实数空间的第五标量z:
Figure PCTCN2019100212-appb-000021
其中,
Figure PCTCN2019100212-appb-000022
是与前文计算窗口大小隐藏状态使用的相同的参数矩阵,W D为可训练的线性变换矩阵。
3)计算机设备通过非线性变换函数将第五标量z转换为一个值域为(0,1)的标量,然后与输入序列长度I相乘,得到一个值域为(0,I)的窗口大小D,D即为统一的局部强化范围的窗口大小,且D与输入序列长度I成比例,可通过以下公式计算得到D:
D=I·sigmoid(z);
其中,sigmoid是非线性变换函数,用于将z转换为一个值域为(0,1)的标量。
虽然,每个元素对应的局部强化范围的窗口大小是相同的,但由于每个元素对应的中心点是依据相应的请求向量计算得到的,因此,每个元素对应的局部强化范围不相同。计算机 设备可基于确定的局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵,其中,两两元素之间的强弱联系通过以下公式计算得到:
Figure PCTCN2019100212-appb-000023
G ij为局部强化矩阵G中第i列向量的第j个元素的值。
图7为一个实施例中根据请求向量序列和键向量序列确定局部强化范围的流程示意图。参照图7,分别通过前馈神经网络将请求向量序列映射成隐藏状态,通过平均池对键向量序列求平均值,然后分别利用线性变换将隐藏状态映射成一个在实数空间的标量,将平均值映射成一个在实数空间的标量,再通过非线性变换函数sigmoid将得到标量分别转换成一个值域为(0,1)的标量,再将该标量与输入序列长度I相乘,得到中心点和窗口大小,从而确定局部强化范围。
在上述实施例中,通过对输入序列对应的键向量序列进行变换,该键向量序列包括了输入序列中所有元素对应的特征向量(键向量),因而,确定的统一的窗口大小考虑了全部的上下文信息,使得基于该统一的窗口大小确定的每个元素对应的局部强化范围可以捕获丰富的上下文信息。
在一个实施例中,对源端向量表示序列进行线性变换,分别得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列,可以包括:将源端向量表示序列分割成多组低维度的源端向量表示子序列;依据多组不同的参数矩阵,分别对每组源端向量表示子序列进行不同的线性变换,得到与每组源端向量表示子序列对应的请求向量序列、键向量序列和值向量序列;该方法还包括:将与每组源端向量表示子序列对应的网络表示子序列拼接后进行线性变换,得到输出的网络表示序列。
其中,可采用堆式多头(Stacked Multi-Head)神经网络对输入序列对应的源端向量表示序列进行处理,那么,可以对源端向量表示序列进行分割处理,得到多组(也叫多头)低维度的源端向量表示子序列。比如,源端向量表示序列包括5个元素,每个元素是512维的列向量,将其分割成8份,也就是得到8个5×64的源端向量表示子序列。分别将这8个源端向量表示子序列作为输入向量,通过不同的子空间进行变换,输出8个5×64的网络表示子序列,对这8个网络表示子序列拼接后进行线性变换,输出一个5×512维的网络表示序列。
举例说明:堆式多头神经网络包括H组子空间,首先输入序列X={x 1,x 2,x 3,...,x I}被转化成源端向量表示序列Z={z 1,z 2,z 3,...,z I}。Z={z 1,z 2,z 3,...,z I}被分割后得到H个源端向量表示子序列。然后,分别在各个子空间中对源端向量表示子序列进行变换,以在第h(h=1、2、…、H)个子空间中进行变换来举例说明:在第h个子空间中,通过相应的可学习参数矩阵
Figure PCTCN2019100212-appb-000024
Figure PCTCN2019100212-appb-000025
对Z h={z h1,z h2,z h3,...,z hI}作线性变换,得到相应的请求向量序列Q h、键向量序列K h和 值向量序列V h,在这H个子空间中,各个子空间使用的这三个可学习参数矩阵都不相同,使得各个子空间分别获得不同的特征向量,进而不同的子空间可以关注不同的局部信息。
接着,在第h个子空间中,计算请求向量序列与键向量序列之间的逻辑相似度E h
Figure PCTCN2019100212-appb-000026
随后,根据请求向量序列Q h或键向量序列K h构建与第h个子空间对应的局部强化矩阵G h,在局部强化矩阵G h中,每个元素G hi,hj的计算公式为
Figure PCTCN2019100212-appb-000027
该计算公式是根据Q h确定第i个元素对应的局部强化范围的中心点P hi,根据Q h或K h确定第i个元素对应的局部强化范围的窗口大小D hi,G hi,hj是局部强化矩阵G h中第i列向量的第j个元素的值,G hi,hj表示第h个子空间所表达的输入序列中第j个元素与第i个元素对应的中心点P hi之间的强弱联系。
然后,在第h个子空间中,应用softmax非线性变换,将逻辑相似度转换为注意力权重分布,并通过局部强化矩阵G h对逻辑相似度进行修正,得到注意力权重分布Α h=softmax(E h+G h),继续在第h个子空间中,通过O=Concat(O 1,O 2,O 3,...,O H)W O计算得到输入序列对应的输出表示序列O h。最后,将各个子空间的输出表示序列O h进行拼接,再进行一次线性变换得到最终输出向量O=Concat(O 1,O 2,O 3,...,O h,...,O H)W O
在一个实施例中,该方法还包括:在得到输入序列对应的网络表示序列后,将网络表示序列作为新的源端向量表示序列,返回对源端向量表示序列进行线性变换,分别得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列的步骤继续执行,直至达到循环停止条件时输出最终的网络表示序列。
其中,神经网络可堆叠多层计算,不管是一层的神经网络还是堆式多头的神经网络,均可以重复多层进行计算。在每层的计算中,将上一层的输出作为下一层的输入,重复执行进行线性变换,分别得到对应的请求向量序列、键向量序列和值向量序列的步骤,直至得到当前层的输出,即为当前层的网络表示序列。考虑到效率和性能,重复的次数可以是6次,每层的神经网络的网络参数都不一样,可以理解,重复6次的过程实际上是将原来的输入序列的源端向量表示序列通过每层的网络参数更新6次的过程。
比如,在堆叠式多头神经网络中,第一层的输出为O L1,在第二层的计算中,将O L1作为输入,通过第二层的网络参数对O L1进行变换,输出第二层的输出O L2….直至达到重复次数,将重复了6次的输出作为最终的输出,即将O L6作为输入序列对应的网络表示序列。
图8为一个实施例中多层的堆叠式多头自关注神经网络的结构示意图。参照图8,对于每一层来说,输入是相同的,均为上一层的输出,然后将输入分割成多个子输入,经过多个子空间(也称多个头)各自的网络参数分别对子输入进行相同的变换,得到每个子空间的输 出,最后将这多个输出拼接后得到当前层的输出,当前层的输出即为下一层的输入,重复多次,将最后一层的输出作为最终的输出。
在一个实施例中,输入序列可以是待翻译的文本序列,输出的网络表示序列是翻译后的文本中各词对应的特征向量,因而可以根据输出的网络表示序列确定翻译后的句子。本申请各种实施例在较长短语和较长句子的翻译上,翻译质量有显著提升。
如图9所示,为一个实施例中神经网络的网络表示方法的流程示意图,包括以下步骤:
S902,获取与输入序列对应的源端向量表示序列。
S904,将源端向量表示序列分割成多组低维度的源端向量表示子序列。
S906,依据多组不同的参数矩阵,分别对每组源端向量表示子序列进行不同的线性变换,得到与每组源端向量表示子序列对应的请求向量序列、键向量序列和值向量序列。
S908,计算请求向量序列与键向量序列之间的逻辑相似度。
S910,对于输入序列中的每个元素,通过第一前馈神经网络对请求向量序列中与该元素对应的请求向量进行变换,得到该元素对应的第一标量。
S912,通过非线性变换函数,对该第一标量进行非线性变换,得到与输入序列长度成比例的第二标量。
S914,将第二标量作为该元素对应的局部强化范围的中心点。
S9162,对于输入序列中的每个元素,通过第二前馈神经网络对请求向量序列中与该元素对应的请求向量进行线性变换,得到该元素对应的第三标量。
S9164,通过非线性变换函数,对该第三标量进行非线性变换,得到与输入序列长度成比例的第四标量。
S9166,将第四标量作为该元素对应的局部强化范围的窗口大小。
S9161,获取键向量序列中的各个键向量,计算各键向量的平均值。
S9163,对平均值进行线性变换得到第五标量。
S9165,通过非线性变换函数对第五标量进行非线性变换,得到与输入序列长度成比例的第六标量。
S9167,将第六标量作为统一的局部强化范围的窗口大小。
S918,按照中心点、窗口大小确定各个元素对应的局部强化范围。
S920,基于局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵。
S922,根据局部强化矩阵对逻辑相似度进行修正,得到局部强化的逻辑相似度。
S924,对局部强化的逻辑相似度进行归一化处理,得到与各元素对应的局部强化的注意 力权重分布。
S926,按照注意力权重分布,对值向量序列中的值向量进行融合,得到输入序列对应的网络表示序列。
S928,将与源端向量表示子序列对应的多组网络表示子序列拼接后进行线性变换,得到输出的网络表示序列。
S930,将输出的网络表示序列作为新的源端向量表示序列,返回步骤S904直至得到最终的网络表示序列。
上述神经网络的网络表示生成方法,基于输入序列对应的请求向量序列来构建局部强化矩阵,能够在局部强化范围内分配注意力权重,强化局部信息。在对输入序列对应的源端向量表示序列进行线性变换后,可得到请求向量序列、键向量序列和值向量序列,可根据请求向量序列、键向量序列得到逻辑相似度,然后基于逻辑相似度和局部强化矩阵进行非线性变换,得到局部强化的注意力权重分布,实现了对原有的注意力权重的修正,再根据局部强化的注意力权重分布对值向量序列进行加权求和,可以得到强化了局部信息的网络表示序列,得到的网络表示序列不仅能强化局部信息,还能保留输入序列中长距离元素之间的联系。
应该理解的是,虽然图9的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图9中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图10所示,提供了一种神经网络的网络表示生成装1000,该装置包括获取模块1002、线性变换模块1004、逻辑相似度计算模块1006、局部强化矩阵构建模块1008、注意力权重分布确定模块1010和融合模块1012,其中:
获取模块1002,用于获取与输入序列对应的源端向量表示序列;
线性变换模块1004,用于对源端向量表示序列进行线性变换,分别得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列;
逻辑相似度计算模块1006,用于计算请求向量序列与键向量序列之间的逻辑相似度;
局部强化矩阵构建模块1008,用于根据请求向量序列构建局部强化矩阵;
注意力权重分布确定模块1010,用于基于逻辑相似度和局部强化矩阵进行非线性变换,得到与各元素对应的局部强化的注意力权重分布;
融合模块1012,用于按照注意力权重分布,对值向量序列中的值向量进行融合,得到输入序列对应的网络表示序列。
在一个实施例中,局部强化矩阵构建模块1008还用于:根据请求向量序列确定各个元素对应的局部强化范围的中心点;根据请求向量序列确定各个元素对应的局部强化范围的窗口大小;按照中心点、窗口大小确定各个元素对应的局部强化范围;基于局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵。
在一个实施例中,局部强化矩阵构建模块1008还用于:根据请求向量序列确定各个元素对应的局部强化范围的中心点;根据键向量序列确定统一的局部强化范围的窗口大小;按照中心点、窗口大小确定各个元素对应的局部强化范围;基于局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵。
在一个实施例中,局部强化矩阵构建模块1008还用于:对于输入序列中的每个元素,通过第一前馈神经网络对请求向量序列中该元素对应的请求向量进行变换,得到元素对应的第一标量;通过非线性变换函数,对第一标量进行非线性变换,得到与输入序列长度成比例的第二标量;将第二标量作为元素对应的局部强化范围的中心点。
在一个实施例中,局部强化矩阵构建模块1008还用于:对于输入序列中的每个元素,通过第二前馈神经网络对请求向量序列中该元素对应的请求向量进行线性变换,得到元素对应的第三标量;通过非线性变换函数,对第三标量进行非线性变换,得到与输入序列长度成比例的第四标量;将第四标量作为元素对应的局部强化范围的窗口大小。
在一个实施例中,局部强化矩阵构建模块1008还用于:获取键向量序列中的各个键向量;计算各键向量的平均值;对平均值进行线性变换得到第五标量;通过非线性变换函数对第五标量进行非线性变换,得到与输入序列长度成比例的第六标量;将第六标量作为统一的局部强化范围的窗口大小。
在一个实施例中,局部强化矩阵构建模块1008还用于:将中心点作为高斯分布的期望、将窗口大小作为高斯分布的方差;根据按照均值和方差确定的高斯分布确定局部强化范围;依据各元素在输入序列中的次序,将两两元素之间的强弱联系依次排列,得到局部强化矩阵;其中,两两元素之间的强弱联系通过以下公式计算得到:
Figure PCTCN2019100212-appb-000028
其中,G ij表示输入序列中第j个元素与第i个元素对应的中心点P i之间的强弱联系,G ij为局部强化矩阵G中第i列向量的第j个元素的值;P i表示第i个元素对应的局部强化范围的中心点;D i表示第i个元素对应的局部强化范围的窗口大小。
在一个实施例中,注意力权重分布确定模块1010还用于:根据局部强化矩阵对逻辑相似度进行修正得到局部强化的逻辑相似度;对局部强化的逻辑相似度进行归一化处理,得到与各元素对应的局部强化的注意力权重分布。
在一个实施例中,线性变换模块1004还用于:将源端向量表示序列分割成多组低维度的源端向量表示子序列;依据多组不同的参数矩阵,分别对每组源端向量表示子序列进行不同的线性变换,得到与每组源端向量表示子序列对应的请求向量序列、键向量序列和值向量序列;该装置还包括拼接模块,用于将与每组源端向量表示子序列对应的网络表示子序列拼接后进行线性变换,得到输出的网络表示序列。
在一个实施例中,该装置1000还包括循环模块,循环模块用于在得到输入序列对应的网络表示序列后,将网络表示序列作为新的源端向量表示序列,返回对源端向量表示序列进行线性变换,分别得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列的步骤继续执行,直至达到循环停止条件时输出最终的网络表示序列。
上述神经网络的网络表示生成装置1000,基于输入序列对应的请求向量序列来构建局部强化矩阵,能够在局部强化范围内分配注意力权重,强化局部信息。在对输入序列对应的源端向量表示序列进行线性变换后,可得到请求向量序列、键向量序列和值向量序列,可根据请求向量序列、键向量序列得到逻辑相似度,然后基于逻辑相似度和局部强化矩阵进行非线性变换,得到局部强化的注意力权重分布,实现了对原有的注意力权重的修正,再根据局部强化的注意力权重分布对值向量序列进行加权求和,可以得到强化了局部信息的网络表示序列,得到的网络表示序列不仅能强化局部信息,还能保留输入序列中长距离元素之间的联系。
图11示出了一个实施例中计算机设备120的内部结构图。如图11所示,该计算机设备包括该计算机设备包括通过***总线连接的处理器、存储器、网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作***,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现神经网络的网络表示生成方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行神经网络的网络表示生成方法。
本领域技术人员可以理解,图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,实际的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的神经网络的网络表示生成装置1000可以实现为一种计 算机程序的形式,计算机程序可在如图11所示的计算机设备上运行。计算机设备的存储器中可存储组成该神经网络的网络表示生成装置1000的各个程序模块,比如,图10所示的获取模块1002、线性变换模块1004、逻辑相似度计算模块1006、局部强化矩阵构建模块1008、注意力权重分布确定模块1010和融合模块1012。各个程序模块构成的计算机程序使得处理器执行本说明书中描述的本申请各个实施例的神经网络的网络表示生成方法中的步骤。
例如,图11所示的计算机设备可以通过如图10所示的神经网络的网络表示生成装置中的获取模块1002执行步骤S202。计算机设备可通过线性变换模块1004执行步骤S204。计算机设备可通过逻辑相似度计算模块1006执行步骤S206。计算机设备可通过局部强化矩阵构建模块1008执行步骤S208。计算机设备可通过注意力权重分布确定模块1010执行步骤S210。计算机设备可通过融合模块1012执行步骤S212。
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器存储有计算机程序,计算机程序被处理器执行时,使得处理器执行以下步骤:获取与输入序列对应的源端向量表示序列;对源端向量表示序列进行线性变换,分别得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列;计算请求向量序列与键向量序列之间的逻辑相似度;根据请求向量序列构建局部强化矩阵;基于逻辑相似度和局部强化矩阵进行非线性变换,得到与各元素对应的局部强化的注意力权重分布;按照注意力权重分布,对值向量序列中的值向量进行融合,得到输入序列对应的网络表示序列。
在一个实施例中,计算机程序被处理器执行根据请求向量序列构建局部强化矩阵的步骤时,使得处理器执行以下步骤:根据请求向量序列确定各个元素对应的局部强化范围的中心点;根据请求向量序列确定各个元素对应的局部强化范围的窗口大小;按照中心点、窗口大小确定各个元素对应的局部强化范围;基于局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵。
在一个实施例中,计算机程序被处理器执行根据请求向量序列构建局部强化矩阵的步骤时,使得处理器执行以下步骤:根据请求向量序列确定各个元素对应的局部强化范围的中心点;根据键向量序列确定统一的局部强化范围的窗口大小;按照中心点、窗口大小确定各个元素对应的局部强化范围;基于局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵。
在一个实施例中,计算机程序被处理器执行根据请求向量序列确定各个元素对应的局部强化范围的中心点的步骤时,使得处理器执行以下步骤:对于输入序列中的每个元素,通过第一前馈神经网络对请求向量序列中与该元素对应的请求向量进行变换,得到元素对应的第 一标量;通过非线性变换函数,对第一标量进行非线性变换,得到与输入序列长度成比例的第二标量;将第二标量作为元素对应的局部强化范围的中心点。
在一个实施例中,计算机程序被处理器执行根据请求向量序列确定各个元素对应的局部强化范围的窗口大小的步骤时,使得处理器执行以下步骤:对于输入序列中的每个元素,通过第二前馈神经网络对请求向量序列中与该元素对应的请求向量进行线性变换,得到元素对应的第三标量;通过非线性变换函数,对第三标量进行非线性变换,得到与输入序列长度成比例的第四标量;将第四标量作为元素对应的局部强化范围的窗口大小。
在一个实施例中,计算机程序被处理器执行根据键向量序列确定统一的局部强化范围的窗口大小的步骤时,使得处理器执行以下步骤:获取键向量序列中的各个键向量;计算各键向量的平均值;对平均值进行线性变换得到第五标量;通过非线性变换函数对第五标量进行非线性变换,得到与输入序列长度成比例的第六标量;将第六标量作为统一的局部强化范围的窗口大小。
在一个实施例中,计算机程序被处理器执行按照中心点、窗口大小确定各个元素对应的局部强化范围的步骤时,使得处理器执行以下步骤:将中心点作为高斯分布的期望、将窗口大小作为高斯分布的方差;根据按照均值和方差确定的高斯分布确定局部强化范围;计算机程序被处理器执行基于局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵的步骤时,使得处理器执行以下步骤:依据各元素在输入序列中的次序,将两两元素之间的强弱联系依次排列,得到局部强化矩阵;其中,两两元素之间的强弱联系通过以下公式计算得到:
Figure PCTCN2019100212-appb-000029
其中,G ij表示输入序列中第j个元素与第i个元素对应的中心点P i之间的强弱联系,G ij为局部强化矩阵G中第i列向量的第j个元素的值;P i表示第i个元素对应的局部强化范围的中心点;D i表示第i个元素对应的局部强化范围的窗口大小。
在一个实施例中,计算机程序被处理器执行基于逻辑相似度和局部强化矩阵进行非线性变换,得到与各元素对应的局部强化的注意力权重分布的步骤时,使得处理器执行以下步骤:根据局部强化矩阵对逻辑相似度进行修正得到局部强化的逻辑相似度;对局部强化的逻辑相似度进行归一化处理,得到与各元素对应的局部强化的注意力权重分布。
在一个实施例中,计算机程序被处理器执行对源端向量表示序列进行线性变换,分别得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列的步骤时,使得处理器执行以下步骤:将源端向量表示序列分割成多组低维度的源端向量表示子序列;依据多组不同的参数矩阵,分别对每组源端向量表示子序列进行不同的线性变换,得到与每组源端向 量表示子序列对应的请求向量序列、键向量序列和值向量序列;计算机程序被处理器执行时,使得处理器还执行以下步骤:将与每组源端向量表示子序列对应的网络表示子序列拼接后进行线性变换,得到输出的网络表示序列。
在一个实施例中,计算机程序被处理器执行时,使得处理器还执行以下步骤:在得到输入序列对应的网络表示序列后,将网络表示序列作为新的源端向量表示序列,返回对源端向量表示序列进行线性变换,分别得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列的步骤继续执行,直至达到循环停止条件时输出最终的网络表示序列。
上述计算机设备,基于输入序列对应的请求向量序列来构建局部强化矩阵,能够在局部强化范围内分配注意力权重,强化局部信息。在对输入序列对应的源端向量表示序列进行线性变换后,可得到请求向量序列、键向量序列和值向量序列,可根据请求向量序列、键向量序列得到逻辑相似度,然后基于逻辑相似度和局部强化矩阵进行非线性变换,得到局部强化的注意力权重分布,实现了对原有的注意力权重的修正,再根据局部强化的注意力权重分布对值向量序列进行加权求和,可以得到强化了局部信息的网络表示序列,得到的网络表示序列不仅能强化局部信息,还能保留输入序列中长距离元素之间的联系。
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时,使得处理器执行以下步骤:获取与输入序列对应的源端向量表示序列;对源端向量表示序列进行线性变换,分别得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列;计算请求向量序列与键向量序列之间的逻辑相似度;根据请求向量序列构建局部强化矩阵;基于逻辑相似度和局部强化矩阵进行非线性变换,得到与各元素对应的局部强化的注意力权重分布;按照注意力权重分布,对值向量序列中的值向量进行融合,得到输入序列对应的网络表示序列。
在一个实施例中,计算机程序被处理器执行根据请求向量序列构建局部强化矩阵的步骤时,使得处理器执行以下步骤:根据请求向量序列确定各个元素对应的局部强化范围的中心点;根据请求向量序列确定各个元素对应的局部强化范围的窗口大小;按照中心点、窗口大小确定各个元素对应的局部强化范围;基于局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵。
在一个实施例中,计算机程序被处理器执行根据请求向量序列构建局部强化矩阵的步骤时,使得处理器执行以下步骤:根据请求向量序列确定各个元素对应的局部强化范围的中心点;根据键向量序列确定统一的局部强化范围的窗口大小;按照中心点、窗口大小确定各个元素对应的局部强化范围;基于局部强化范围计算两两元素之间的强弱联系,得到局部强化 矩阵。
在一个实施例中,计算机程序被处理器执行根据请求向量序列确定各个元素对应的局部强化范围的中心点的步骤时,使得处理器执行以下步骤:对于输入序列中的每个元素,通过第一前馈神经网络对请求向量序列中与该元素对应的请求向量进行变换,得到元素对应的第一标量;通过非线性变换函数,对第一标量进行非线性变换,得到与输入序列长度成比例的第二标量;将第二标量作为元素对应的局部强化范围的中心点。
在一个实施例中,计算机程序被处理器执行根据请求向量序列确定各个元素对应的局部强化范围的窗口大小的步骤时,使得处理器执行以下步骤:对于输入序列中的每个元素,通过第二前馈神经网络对请求向量序列中与该元素对应的请求向量进行线性变换,得到各元素对应的第三标量;通过非线性变换函数,对第三标量进行非线性变换,得到与输入序列长度成比例的第四标量;将第四标量作为元素对应的局部强化范围的窗口大小。
在一个实施例中,计算机程序被处理器执行根据键向量序列确定统一的局部强化范围的窗口大小的步骤时,使得处理器执行以下步骤:获取键向量序列中的各个键向量;计算各键向量的平均值;对平均值进行线性变换得到第五标量;通过非线性变换函数对第五标量进行非线性变换,得到与输入序列长度成比例的第六标量;将第六标量作为统一的局部强化范围的窗口大小。
在一个实施例中,计算机程序被处理器执行按照中心点、窗口大小确定各个元素对应的局部强化范围的步骤时,使得处理器执行以下步骤:将中心点作为高斯分布的期望、将窗口大小作为高斯分布的方差;根据按照均值和方差确定的高斯分布确定局部强化范围;计算机程序被处理器执行基于局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵的步骤时,使得处理器执行以下步骤:依据各元素在输入序列中的次序,将两两元素之间的强弱联系依次排列,得到局部强化矩阵;其中,两两元素之间的强弱联系通过以下公式计算得到:
Figure PCTCN2019100212-appb-000030
其中,G ij表示输入序列中第j个元素与第i个元素对应的中心点P i之间的强弱联系,G ij为局部强化矩阵G中第i列向量的第j个元素的值;P i表示第i个元素对应的局部强化范围的中心点;D i表示第i个元素对应的局部强化范围的窗口大小。
在一个实施例中,计算机程序被处理器执行基于逻辑相似度和局部强化矩阵进行非线性变换,得到与各元素对应的局部强化的注意力权重分布的步骤时,使得处理器执行以下步骤:根据局部强化矩阵对逻辑相似度进行修正得到局部强化的逻辑相似度;对局部强化的逻辑相似度进行归一化处理,得到与各元素对应的局部强化的注意力权重分布。
在一个实施例中,计算机程序被处理器执行对源端向量表示序列进行线性变换,分别得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列的步骤时,使得处理器执行以下步骤:将源端向量表示序列分割成多组低维度的源端向量表示子序列;依据多组不同的参数矩阵,分别对每组源端向量表示子序列进行不同的线性变换,得到与每组源端向量表示子序列对应的请求向量序列、键向量序列和值向量序列;计算机程序被处理器执行时,使得处理器还执行以下步骤:将与每组源端向量表示子序列对应的网络表示子序列拼接后进行线性变换,得到输出的网络表示序列。
在一个实施例中,计算机程序被处理器执行时,使得处理器还执行以下步骤:在得到输入序列对应的网络表示序列后,将网络表示序列作为新的源端向量表示序列,返回对源端向量表示序列进行线性变换,分别得到与源端向量表示序列对应的请求向量序列、键向量序列和值向量序列的步骤继续执行,直至达到循环停止条件时输出最终的网络表示序列。
上述计算机可读存储介质,基于输入序列对应的请求向量序列来构建局部强化矩阵,能够在局部强化范围内分配注意力权重,强化局部信息。在对输入序列对应的源端向量表示序列进行线性变换后,可得到请求向量序列、键向量序列和值向量序列,可根据请求向量序列、键向量序列得到逻辑相似度,然后基于逻辑相似度和局部强化矩阵进行非线性变换,得到局部强化的注意力权重分布,实现了对原有的注意力权重的修正,再根据局部强化的注意力权重分布对值向量序列进行加权求和,可以得到强化了局部信息的网络表示序列,得到的网络表示序列不仅能强化局部信息,还能保留输入序列中长距离元素之间的联系。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上实施例仅表达了本申请的几种实施方式,其描述较为详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (22)

  1. 一种神经网络的网络表示生成方法,用于计算机设备中,所述方法包括:
    获取与输入序列对应的源端向量表示序列;
    对所述源端向量表示序列进行线性变换,分别得到与所述源端向量表示序列对应的请求向量序列、键向量序列和值向量序列;
    计算所述请求向量序列与所述键向量序列之间的逻辑相似度;
    根据所述请求向量序列构建局部强化矩阵;
    基于所述逻辑相似度和所述局部强化矩阵进行非线性变换,得到与各所述元素对应的局部强化的注意力权重分布;
    按照所述注意力权重分布,对所述值向量序列中的值向量进行融合,得到所述输入序列对应的网络表示序列。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述请求向量序列构建局部强化矩阵,包括:
    根据所述请求向量序列确定各个元素对应的局部强化范围的中心点;
    根据所述请求向量序列确定各个元素对应的局部强化范围的窗口大小;
    按照所述中心点、所述窗口大小确定各个元素对应的局部强化范围;
    基于所述局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵。
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述请求向量序列构建局部强化矩阵,包括:
    根据所述请求向量序列确定各个元素对应的局部强化范围的中心点;
    根据所述键向量序列确定统一的局部强化范围的窗口大小;
    按照所述中心点、所述窗口大小确定各个元素对应的局部强化范围;
    基于所述局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵。
  4. 根据权利要求2或3所述的方法,其特征在于,所述根据所述请求向量序列确定各个元素对应的局部强化范围的中心点,包括:
    对于所述输入序列中的每个元素,通过第一前馈神经网络对所述请求向量序列中与所述元素对应的请求向量进行变换,得到所述元素对应的第一标量;
    通过非线性变换函数,对所述第一标量进行非线性变换,得到与输入序列长度成比例的第二标量;
    将所述第二标量作为所述元素对应的局部强化范围的中心点。
  5. 根据权利要求2述的方法,其特征在于,所述根据所述请求向量序列确定各个元素对应的局部强化范围的窗口大小,包括:
    对于所述输入序列中的每个元素,通过第二前馈神经网络对所述请求向量序列中与所述元素对应的请求向量进行线性变换,得到所述元素对应的第三标量;
    通过非线性变换函数,对所述第三标量进行非线性变换,得到与输入序列长度成比例的第四标量;
    将所述第四标量作为所述元素对应的局部强化范围的窗口大小。
  6. 根据权利要求3所述的方法,其特征在于,所述根据所述键向量序列确定统一的局部强化范围的窗口大小,包括:
    获取所述键向量序列中的各个键向量;
    计算各所述键向量的平均值;
    对所述平均值进行线性变换得到第五标量;
    通过非线性变换函数对所述第五标量进行非线性变换,得到与输入序列长度成比例的第六标量;
    将所述第六标量作为统一的局部强化范围的窗口大小。
  7. 根据权利要求2或3所述的方法,其特征在于,所述按照所述中心点、所述窗口大小确定各个元素对应的局部强化范围,包括:
    将所述中心点作为高斯分布的期望、将所述窗口大小作为高斯分布的方差;
    根据按照所述均值和所述方差确定的高斯分布确定局部强化范围;
    所述基于所述局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵,包括:
    依据各所述元素在输入序列中的次序,将两两元素之间的强弱联系依次排列,得到局部强化矩阵;其中,两两元素之间的强弱联系通过以下公式计算得到:
    Figure PCTCN2019100212-appb-100001
    其中,G ij表示所述输入序列中第j个元素与第i个元素对应的中心点P i之间的强弱联系,G ij为局部强化矩阵G中第i列向量的第j个元素的值;P i表示第i个元素对应的局部强化范围的中心点;D i表示第i个元素对应的局部强化范围的窗口大小。
  8. 根据权利要求1至3任一项所述的方法,其特征在于,所述基于所述逻辑相似度和所述局部强化矩阵进行非线性变换,得到与各所述元素对应的局部强化的注意力权重分布,包括:
    根据所述局部强化矩阵对所述逻辑相似度进行修正,得到局部强化的逻辑相似度;
    对所述局部强化的逻辑相似度进行归一化处理,得到与各所述元素对应的局部强化的注意力权重分布。
  9. 根据权利要求1至3任一项所述的方法,其特征在于,所述对所述源端向量表示序列进行线性变换,分别得到与所述源端向量表示序列对应的请求向量序列、键向量序列和值向量序列,包括:
    将所述源端向量表示序列分割成多组低维度的源端向量表示子序列;
    依据多组不同的参数矩阵,分别对每组源端向量表示子序列进行不同的线性变换,得到与每组源端向量表示子序列对应的请求向量序列、键向量序列和值向量序列;
    所述方法还包括:
    将与每组源端向量表示子序列对应的网络表示子序列拼接后进行线性变换,得到输出的网络表示序列。
  10. 根据权利要求1至3任一项所述的方法,其特征在于,所述方法还包括:
    在得到输入序列对应的网络表示序列后,将所述网络表示序列作为新的源端向量表示序列,返回所述对所述源端向量表示序列进行线性变换,分别得到与所述源端向量表示序列对应的请求向量序列、键向量序列和值向量序列的步骤继续执行,直至达到循环停止条件时输出最终的网络表示序列。
  11. 一种神经网络的网络表示生成装置,其特征在于,所述装置包括:
    获取模块,用于获取与输入序列对应的源端向量表示序列;
    线性变换模块,用于对所述源端向量表示序列进行线性变换,分别得到与所述源端向量表示序列对应的请求向量序列、键向量序列和值向量序列;
    逻辑相似度计算模块,用于计算所述请求向量序列与所述键向量序列之间的逻辑相似度;
    局部强化矩阵构建模块,用于根据所述请求向量序列构建局部强化矩阵;
    注意力权重分布确定模块,用于基于所述逻辑相似度和所述局部强化矩阵进行非线性变换,得到与各所述元素对应的局部强化的注意力权重分布;
    融合模块,用于按照所述注意力权重分布,对所述值向量序列中的值向量进行融合,得到所述输入序列对应的网络表示序列。
  12. 根据权利要求11所述的装置,其特征在于,所述局部强化矩阵构建模块还用于:根据所述请求向量序列确定各个元素对应的局部强化范围的中心点;根据所述请求向量序列 确定各个元素对应的局部强化范围的窗口大小;按照所述中心点、所述窗口大小确定各个元素对应的局部强化范围;基于所述局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵。
  13. 根据权利要求11所述的装置,其特征在于,所述局部强化矩阵构建模块还用于:根据所述请求向量序列确定各个元素对应的局部强化范围的中心点;根据所述键向量序列确定统一的局部强化范围的窗口大小;按照所述中心点、所述窗口大小确定各个元素对应的局部强化范围;基于所述局部强化范围计算两两元素之间的强弱联系,得到局部强化矩阵。
  14. 根据权利要求12或13所述的装置,其特征在于,所述局部强化矩阵构建模块还用于:对于所述输入序列中的每个元素,通过第一前馈神经网络对所述请求向量序列中与所述元素对应的请求向量进行变换,得到所述元素对应的第一标量;通过非线性变换函数,对所述第一标量进行非线性变换,得到与输入序列长度成比例的第二标量;将所述第二标量作为所述元素对应的局部强化范围的中心点。
  15. 根据权利要求12述的装置,其特征在于,所述局部强化矩阵构建模块还用于:对于所述输入序列中的每个元素,通过第二前馈神经网络对所述请求向量序列中与所述元素对应的请求向量进行线性变换,得到所述元素对应的第三标量;通过非线性变换函数,对所述第三标量进行非线性变换,得到与输入序列长度成比例的第四标量;将所述第四标量作为所述元素对应的局部强化范围的窗口大小。
  16. 根据权利要求13所述的装置,其特征在于,所述局部强化矩阵构建模块还用于:获取所述键向量序列中的各个键向量;计算各所述键向量的平均值;对所述平均值进行线性变换得到第五标量;通过非线性变换函数对所述第五标量进行非线性变换,得到与输入序列长度成比例的第六标量;将所述第六标量作为统一的局部强化范围的窗口大小。
  17. 根据权利要求12或13所述的装置,其特征在于,所述局部强化矩阵构建模块还用于:将所述中心点作为高斯分布的期望、将所述窗口大小作为高斯分布的方差;根据按照所述均值和所述方差确定的高斯分布确定局部强化范围;依据各所述元素在输入序列中的次序,将两两元素之间的强弱联系依次排列,得到局部强化矩阵;其中,两两元素之间的强弱联系通过以下公式计算得到:
    Figure PCTCN2019100212-appb-100002
    其中,G ij表示所述输入序列中第j个元素与第i个元素对应的中心点P i之间的强弱联系,G ij为局部强化矩阵G中第i列向量的第j个元素的值;P i表示第i个元素对应的局部强化范围的中心点;D i表示第i个元素对应的局部强化范围的窗口大小。
  18. 根据权利要求11至13任一项所述的装置,其特征在于,所述注意力权重分布确定模块还用于:根据所述局部强化矩阵对所述逻辑相似度进行修正,得到局部强化的逻辑相似度;对所述局部强化的逻辑相似度进行归一化处理,得到与各所述元素对应的局部强化的注意力权重分布。
  19. 根据权利要求11至13任一项所述的装置,其特征在于,所述线性变换模块还用于:将所述源端向量表示序列分割成多组低维度的源端向量表示子序列;依据多组不同的参数矩阵,分别对每组源端向量表示子序列进行不同的线性变换,得到与每组源端向量表示子序列对应的请求向量序列、键向量序列和值向量序列;
    所述装置还包括:
    拼接模块,用于将与每组源端向量表示子序列对应的网络表示子序列拼接后进行线性变换,得到输出的网络表示序列。
  20. 根据权利要求11至13任一项所述的装置,其特征在于,所述装置还包括:
    循环模块,用于在得到输入序列对应的网络表示序列后,将所述网络表示序列作为新的源端向量表示序列,返回所述对所述源端向量表示序列进行线性变换,分别得到与所述源端向量表示序列对应的请求向量序列、键向量序列和值向量序列的步骤继续执行,直至达到循环停止条件时输出最终的网络表示序列。
  21. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1至10中任一项所述方法的步骤。
  22. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至10中任一项所述方法的步骤。
PCT/CN2019/100212 2018-09-04 2019-08-12 神经网络的网络表示生成方法、装置、存储介质和设备 WO2020048292A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2020551812A JP7098190B2 (ja) 2018-09-04 2019-08-12 ニューラルネットワークのネットワーク表示生成方法及びその装置、コンピュータプログラム並びに機器
EP19857335.4A EP3848856A4 (en) 2018-09-04 2019-08-12 METHOD AND DEVICE FOR GENERATING A NETWORK REPRESENTATION OF A NEURAL NETWORK, STORAGE MEDIUM AND DEVICE
US17/069,609 US11875220B2 (en) 2018-09-04 2020-10-13 Method, apparatus, and storage medium for generating network representation for neural network

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811027795.X 2018-09-04
CN201811027795.XA CN109034378B (zh) 2018-09-04 2018-09-04 神经网络的网络表示生成方法、装置、存储介质和设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/069,609 Continuation US11875220B2 (en) 2018-09-04 2020-10-13 Method, apparatus, and storage medium for generating network representation for neural network

Publications (1)

Publication Number Publication Date
WO2020048292A1 true WO2020048292A1 (zh) 2020-03-12

Family

ID=64623896

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/100212 WO2020048292A1 (zh) 2018-09-04 2019-08-12 神经网络的网络表示生成方法、装置、存储介质和设备

Country Status (5)

Country Link
US (1) US11875220B2 (zh)
EP (1) EP3848856A4 (zh)
JP (1) JP7098190B2 (zh)
CN (1) CN109034378B (zh)
WO (1) WO2020048292A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112785848A (zh) * 2021-01-04 2021-05-11 清华大学 一种交通数据预测方法以及***
CN113378791A (zh) * 2021-07-09 2021-09-10 合肥工业大学 基于双注意力机制和多尺度特征融合的宫颈细胞分类方法

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034378B (zh) * 2018-09-04 2023-03-31 腾讯科技(深圳)有限公司 神经网络的网络表示生成方法、装置、存储介质和设备
CN109918630B (zh) * 2019-01-23 2023-08-04 平安科技(深圳)有限公司 文本生成方法、装置、计算机设备及存储介质
CN111368564B (zh) * 2019-04-17 2022-04-08 腾讯科技(深圳)有限公司 文本处理方法、装置、计算机可读存储介质和计算机设备
CN110276082B (zh) * 2019-06-06 2023-06-30 百度在线网络技术(北京)有限公司 基于动态窗口的翻译处理方法和装置
CN110347790B (zh) * 2019-06-18 2021-08-10 广州杰赛科技股份有限公司 基于注意力机制的文本查重方法、装置、设备及存储介质
CN110705273B (zh) * 2019-09-02 2023-06-13 腾讯科技(深圳)有限公司 基于神经网络的信息处理方法及装置、介质和电子设备
US11875131B2 (en) * 2020-09-16 2024-01-16 International Business Machines Corporation Zero-shot cross-lingual transfer learning
CN112434527B (zh) * 2020-12-03 2024-06-18 上海明略人工智能(集团)有限公司 一种关键词的确定方法、装置、电子设备及存储介质
CN112967112B (zh) * 2021-03-24 2022-04-29 武汉大学 一种自注意力机制和图神经网络的电商推荐方法
CN113392139B (zh) * 2021-06-04 2023-10-20 中国科学院计算技术研究所 一种基于关联融合的环境监测数据补全方法及***
CN113254592B (zh) * 2021-06-17 2021-10-22 成都晓多科技有限公司 基于门机制的多级注意力模型的评论方面检测方法及***
CN113283235B (zh) * 2021-07-21 2021-11-19 明品云(北京)数据科技有限公司 一种用户标签的预测方法及***
CN113887325A (zh) * 2021-09-10 2022-01-04 北京三快在线科技有限公司 一种模型训练方法、表情识别方法以及装置
CN117180952B (zh) * 2023-11-07 2024-02-02 湖南正明环保股份有限公司 多向气流料层循环半干法烟气脱硫***及其方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107783960A (zh) * 2017-10-23 2018-03-09 百度在线网络技术(北京)有限公司 用于抽取信息的方法、装置和设备
CN107797992A (zh) * 2017-11-10 2018-03-13 北京百分点信息科技有限公司 命名实体识别方法及装置
US20180240013A1 (en) * 2017-02-17 2018-08-23 Google Inc. Cooperatively training and/or using separate input and subsequent content neural networks for information retrieval
CN109034378A (zh) * 2018-09-04 2018-12-18 腾讯科技(深圳)有限公司 神经网络的网络表示生成方法、装置、存储介质和设备

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09297112A (ja) * 1996-03-08 1997-11-18 Mitsubishi Heavy Ind Ltd 構造パラメータ解析装置及び解析方法
US7496546B2 (en) * 2003-03-24 2009-02-24 Riken Interconnecting neural network system, interconnecting neural network structure construction method, self-organizing neural network structure construction method, and construction programs therefor
CN104765728B (zh) 2014-01-08 2017-07-18 富士通株式会社 训练神经网络的方法和装置以及确定稀疏特征向量的方法
EP3141610A1 (en) * 2015-09-12 2017-03-15 Jennewein Biotechnologie GmbH Production of human milk oligosaccharides in microbial hosts with engineered import / export
CN106056526B (zh) * 2016-05-26 2019-04-12 南昌大学 一种基于解析稀疏表示与压缩感知的图像加密算法
CN106096640B (zh) * 2016-05-31 2019-03-26 合肥工业大学 一种多模式***的特征降维方法
CN106339564B (zh) * 2016-09-06 2017-11-24 西安石油大学 一种基于灰色关联聚类的射孔方案优选方法
CN106571135B (zh) * 2016-10-27 2020-06-09 苏州大学 一种耳语音特征提取方法及***
CN107025219B (zh) * 2017-04-19 2019-07-26 厦门大学 一种基于内部语义层次结构的词嵌入表示方法
CN107180247A (zh) * 2017-05-19 2017-09-19 中国人民解放军国防科学技术大学 基于选择性注意力卷积神经网络的关系分类器及其方法
CN107345860B (zh) * 2017-07-11 2019-05-31 南京康尼机电股份有限公司 基于时间序列数据挖掘的轨道车辆门亚健康状态识别方法
GB2566257A (en) * 2017-08-29 2019-03-13 Sky Cp Ltd System and method for content discovery
CN108256172B (zh) * 2017-12-26 2021-12-07 同济大学 一种顶管下穿既有箱涵过程中险情预警预报方法
CN108537822B (zh) * 2017-12-29 2020-04-21 西安电子科技大学 基于加权置信度估计的运动目标跟踪方法
CN108334499B (zh) * 2018-02-08 2022-03-18 海南云江科技有限公司 一种文本标签标注设备、方法和计算设备
CN108828533B (zh) * 2018-04-26 2021-12-31 电子科技大学 一种类内样本相似结构保持非线性投影特征提取方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180240013A1 (en) * 2017-02-17 2018-08-23 Google Inc. Cooperatively training and/or using separate input and subsequent content neural networks for information retrieval
CN107783960A (zh) * 2017-10-23 2018-03-09 百度在线网络技术(北京)有限公司 用于抽取信息的方法、装置和设备
CN107797992A (zh) * 2017-11-10 2018-03-13 北京百分点信息科技有限公司 命名实体识别方法及装置
CN109034378A (zh) * 2018-09-04 2018-12-18 腾讯科技(深圳)有限公司 神经网络的网络表示生成方法、装置、存储介质和设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3848856A4

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112785848A (zh) * 2021-01-04 2021-05-11 清华大学 一种交通数据预测方法以及***
CN112785848B (zh) * 2021-01-04 2022-06-17 清华大学 一种交通数据预测方法以及***
CN113378791A (zh) * 2021-07-09 2021-09-10 合肥工业大学 基于双注意力机制和多尺度特征融合的宫颈细胞分类方法
CN113378791B (zh) * 2021-07-09 2022-08-05 合肥工业大学 基于双注意力机制和多尺度特征融合的宫颈细胞分类方法

Also Published As

Publication number Publication date
EP3848856A4 (en) 2021-11-17
US11875220B2 (en) 2024-01-16
CN109034378B (zh) 2023-03-31
JP2021517316A (ja) 2021-07-15
EP3848856A1 (en) 2021-07-14
US20210042603A1 (en) 2021-02-11
CN109034378A (zh) 2018-12-18
JP7098190B2 (ja) 2022-07-11

Similar Documents

Publication Publication Date Title
WO2020048292A1 (zh) 神经网络的网络表示生成方法、装置、存储介质和设备
CN109146064B (zh) 神经网络训练方法、装置、计算机设备和存储介质
CN109271646B (zh) 文本翻译方法、装置、可读存储介质和计算机设备
US11853709B2 (en) Text translation method and apparatus, storage medium, and computer device
CN107590192B (zh) 文本问题的数学化处理方法、装置、设备和存储介质
CN111460807B (zh) 序列标注方法、装置、计算机设备和存储介质
CN111061847A (zh) 对话生成及语料扩充方法、装置、计算机设备和存储介质
CN108665506B (zh) 图像处理方法、装置、计算机存储介质及服务器
BR112020022270A2 (pt) sistemas e métodos para unificar modelos estatísticos para diferentes modalidades de dados
BR112019014822B1 (pt) Sistema, meio não transitório de armazenamento em computador e método de redes neurais de transdução de sequências baseadas em atenção
CN108776832B (zh) 信息处理方法、装置、计算机设备和存储介质
WO2021196954A1 (zh) 序列化数据处理方法和装置、文本处理方法和装置
CN109710953B (zh) 一种翻译方法及装置、计算设备、存储介质和芯片
WO2021159201A1 (en) Initialization of parameters for machine-learned transformer neural network architectures
WO2020192307A1 (zh) 基于深度学习的答案抽取方法、装置、计算机设备和存储介质
WO2021000412A1 (zh) 文本匹配度检测方法、装置、计算机设备和可读存储介质
WO2020211611A1 (zh) 用于语言处理的循环神经网络中隐状态的生成方法和装置
WO2021139344A1 (zh) 基于人工智能的文本生成方法、装置、计算机设备和介质
CN112699215B (zh) 基于胶囊网络与交互注意力机制的评级预测方法及***
CN112560456A (zh) 一种基于改进神经网络的生成式摘要生成方法和***
CN111597339A (zh) 文档级多轮对话意图分类方法、装置、设备及存储介质
CN112837673B (zh) 基于人工智能的语音合成方法、装置、计算机设备和介质
CN112364650A (zh) 一种实体关系联合抽取方法、终端以及存储介质
CN116469359A (zh) 音乐风格迁移方法、装置、计算机设备及存储介质
CN113420869B (zh) 基于全方向注意力的翻译方法及其相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19857335

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020551812

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019857335

Country of ref document: EP

Effective date: 20210406