CN111291576B

CN111291576B - Method, device, equipment and medium for determining internal representation information quantity of neural network

Info

Publication number: CN111291576B
Application number: CN202010151758.0A
Authority: CN
Inventors: 王龙跃; 杨依林; 史树明; 涂兆鹏
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2022-07-01
Anticipated expiration: 2040-03-06
Also published as: CN111291576A

Abstract

The disclosure provides a method, a device, equipment and a medium for determining the internal representation information quantity of a neural network. The method for determining the quantity of the information expressed in the neural network comprises the following steps: processing an input text vector by using the neural network, and extracting an internal representation generated by a feature processing layer in the neural network; fitting a target text vector and the internal representation by using a probe decoder to obtain a probability value, wherein the probability value represents the probability of mapping the internal representation into the target text vector; determining an amount of information of the internal representation relative to the target text vector based on the probability value.

Description

Method, device, equipment and medium for determining internal representation information quantity of neural network

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a medium for determining an internal representation information amount of a neural network.

Background

The neural network model encodes and maps input information to output through interaction between neurons. In the neural network model, as the task difficulty increases, the complexity of the network structure increases, and in addition, the identical network modules are also overlapped for multiple times. At present, the information learned by each network module in the neural network cannot be quantitatively described, that is, how each network module plays a role in the output cannot be clarified, which limits the deep understanding of the neural network processing process.

Disclosure of Invention

The present disclosure provides a neural network internal representation information quantity determination method, apparatus, device, and medium for determining an internal representation information quantity generated by a feature processing layer in a neural network.

According to an aspect of the present disclosure, there is provided a method for determining an amount of information represented inside a neural network, including: processing an input text vector by using the neural network, and extracting an internal representation generated by a feature processing layer in the neural network; fitting a target text vector and the internal representation by using a probe decoder to obtain a probability value, wherein the probability value represents the probability of mapping the internal representation into the target text vector; and determining an amount of information of the internal representation relative to the target text vector based on the probability value.

According to some embodiments of the present disclosure, the probe decoder includes a self-attention processing layer, an encoding-decoding attention processing layer, and a full-connection processing layer.

According to some embodiments of the present disclosure, the neural network is a machine translation neural network comprising an encoder network and a decoder network, the decoder network comprising at least one decoder comprising a self-attention processing layer, an encoding-decoding attention processing layer and a fully-connected processing layer, wherein the feature processing layer is a processing layer belonging to the decoder network.

According to some embodiments of the disclosure, the determining an amount of information of the internal representation relative to the target text vector based on the probability value comprises: and calculating the negative log likelihood similarity for representing the information quantity based on the probability value.

According to some embodiments of the disclosure, the target text vector is a translated text vector of the input text vector, the method further comprising: changing a network structure of the machine translation neural network based on an amount of information of the internal representation relative to the target text vector.

According to some embodiments of the disclosure, the fully-connected processing layer comprises an addition-normalization layer and a feed-forward layer, wherein the changing the network structure of the machine translation neural network comprises: determining a first amount of information of the internal representation of the fully-connected processing layer relative to the target text vector, and determining a second amount of information and a third amount of information of the internal representation of the addition-normalization layer and the feedforward layer in the fully-connected processing layer relative to the target text vector, respectively; determining to delete at least a portion of the fully-connected processing layers based on the first, second, and third amounts of information.

According to some embodiments of the disclosure, the target text vector is one of: the input text vector; a translated text vector of the input text vector, wherein the input text vector corresponds to a first language and the translated text vector corresponds to a second language different from the first language.

According to another aspect of the present disclosure, there is also provided a neural network internal representation information amount determination apparatus including: an internal representation unit configured to process an input text vector by using the neural network and extract an internal representation generated by a feature processing layer in the neural network; a probability unit configured to perform fitting processing on a target text vector and the internal representation by using a probe decoder to obtain a probability value, wherein the probability value represents a probability of mapping from the internal representation to the target text vector; and an information amount calculation unit configured to determine an amount of information of the internal representation with respect to the target text vector based on the probability value.

According to some embodiments of the present disclosure, the probe decoder includes a self-attention processing layer, an encoding-decoding attention processing layer, and a fully-connected processing layer.

According to some embodiments of the present disclosure, the information amount calculation unit is configured to: and calculating the negative log likelihood similarity for representing the information quantity based on the probability value.

According to some embodiments of the disclosure, the target text vector is a translated text vector of the input text vector, the apparatus further comprising an improvement unit configured to: changing a network structure of the machine translation neural network based on an amount of information of the internal representation relative to the target text vector, wherein the fully-connected processing layer includes an addition-normalization layer and a feed-forward layer, the improvement unit configured to: determining a first amount of information of the internal representation of the fully-connected processing layer relative to the target text vector, and determining a second amount of information and a third amount of information of the internal representation of the addition-normalization layer and the feedforward layer in the fully-connected processing layer relative to the target text vector, respectively; determining to delete at least a portion of the fully-connected processing layers based on the first, second, and third amounts of information.

According to still another aspect of the present disclosure, there is also provided a neural network internal representation information amount determination device including: a processor; a memory, wherein the memory has stored therein computer readable code, which when executed by the processor, performs a neural network internal representation information amount determination method as described above.

According to yet another aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to execute the neural network internal representation information amount determination method as described above.

By using the method for determining the internal representation information quantity of the neural network, a probe decoder can be used for performing fitting processing on the target text vector and the internal representation generated by the feature processing layer in the neural network to obtain a probability value, and the information quantity of the internal representation relative to the target text vector is determined based on the probability value. The determined information quantity can be used for analyzing information learned by each network module in the neural network based on the input vector, and further obtaining an information transfer process between each module in the neural network.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 shows a flow diagram of a neural network internal representation information quantity determination method according to an embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of a probe decoder according to an embodiment of the present disclosure;

FIG. 3A shows a schematic block diagram of a translation network according to an embodiment of the present disclosure;

FIG. 3B illustrates a network architecture diagram of a translation network according to an embodiment of the present disclosure;

fig. 3C shows a schematic diagram of a decoder in a translation network according to an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of information volume variation of a fully-connected processing layer according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of the amount of information relative to an input text vector, in accordance with an embodiment of the present disclosure;

FIG. 6 shows a schematic diagram of the amount of information relative to a translated text vector, in accordance with an embodiment of the disclosure;

FIG. 7 shows a schematic block diagram of a neural network internal representation information quantity determination apparatus in accordance with an embodiment of the present disclosure;

FIG. 8 shows a schematic block diagram of a neural network internal representation information quantity determination device in accordance with an embodiment of the present disclosure;

FIG. 9 shows a schematic diagram of an architecture of an exemplary computing device, in accordance with embodiments of the present disclosure;

FIG. 10 shows a schematic diagram of a computer storage medium according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without any inventive step, are intended to be within the scope of the present disclosure.

The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Likewise, the word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

Flow charts are used in this disclosure to illustrate steps of methods according to embodiments of the disclosure. It should be understood that the preceding and following steps are not necessarily performed in the exact order in which they are performed. Rather, various steps may be processed in reverse order or simultaneously. Also, other operations may be added to the processes.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Specifically, Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between a person and a computer using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Since the research in this field relates to natural language, i.e. the language that people use everyday, it is also closely related to the research of linguistics.

Neural Machine Translation (NMT) is a branch of natural language processing, and is used to implement Machine Translation based on a Neural network, such as Translation between different languages, such as chinese Translation and english Translation. For example, the neural network for machine translation may be a Transformer model (Transformer) having a Self-Attention (Self-Attention) based encoder-decoder framework, which is a commonly applied conditional sequence generation (conditional sequence generation) model structure at present.

In the process of natural language processing using a neural network, the neural network receives text vectors obtained by Word embedding (Word embedding) of words and sentences, for example, and processes the received text vectors to obtain an output result. For example, the output result may be a translation result of the word or sentence. As described above, in the process of mapping the input vector code to the output by the neural network model through the processing between the neurons, the information learned by each network module in the neural network cannot be described in a quantitative manner, that is, how each network module plays a role in the output cannot be determined, which limits the deep understanding of the neural network processing process. Furthermore, improvements to neural networks are limited by the lack of knowledge of the information transfer process between the various modules in the neural network.

The present disclosure provides a neural network internal representation information quantity determination method for determining an internal representation information quantity generated by a feature processing layer in a neural network. Fig. 1 shows a flow chart diagram of a neural network internal representation information amount determination method according to an embodiment of the present disclosure.

As shown in fig. 1, first, in step S101, an input text vector is processed by using a neural network, and an internal representation generated by a feature processing layer in the neural network is extracted.

According to an embodiment of the present disclosure, the input text vector may be derived from the text to be processed, and the text may be converted into a text vector by, for example, a word embedding method.

According to the embodiment of the present disclosure, the neural network may be a network used to perform the above-described natural processing, and the present disclosure does not specifically limit a specific network structure of the neural network. As one example, the network may be a machine translation neural network for translating input text into other languages. As other examples, the neural network may be a network for extracting a digest of an article, a network for extracting a feature of text, or the like. According to an embodiment of the present disclosure, the feature processing layer in the neural network may be any processing layer in the neural network, such as a convolutional layer, and specific examples thereof will be described in detail below.

According to the embodiment of the present disclosure, the internal representation is an intermediate processing result obtained by the feature processing layer in the network based on the input text vector, and may be understood as a feature vector obtained by the input text vector, which contains information of the input text vector and is used to transfer to other processing layers in the network to obtain a final output result.

Next, as shown in fig. 1, in step S102, a probe decoder is used to perform fitting processing on the target text vector and the internal representation, and a probability value is obtained. According to an embodiment of the present disclosure, the fitting process represents a process of deriving a target text vector from an internal representation, a degree of mapping between the two is calculated, and the probability value represents a probability of mapping from the internal representation to the target text vector. According to an embodiment of the present disclosure, the probe decoder is configured to implement the process of fitting, which receives the internal representation and the target text vector, and calculates a probability value representing a degree of mapping between the internal representation and the target text vector by fitting the internal representation and the target text vector. A specific structure of the probe decoder will be described below.

Next, in step S103, an amount of information of the internal representation relative to the target text vector is determined based on the probability value.

In a method according to the present disclosure, the target text vector is the comparison object of the internal representation. According to an embodiment of the present disclosure, the amount of information of the internal representation relative to the target text vector is determined based on a probability value mapped to the target text vector by the internal representation. I.e. the determined amount of information is a relative amount. The higher the probability value, the easier it is to map the internal representation to the target text vector, i.e. the more information content of the target text vector is comprised in the internal representation.

As an example, the target text vector may be a true output vector of the input text vector, i.e. a processing result expected based on the input text vector using the neural network. For example, in the case where the neural network is a machine translation neural network, the target text vector may be a translation text vector of the input text vector, e.g., in the process of translating chinese, the input text vector may be "I am a student" and the translation vector corresponding to the input text vector may be "I am a student". In case the target text vector is a true output vector, the amount of information obtained for the internal representation relative to the true output vector may be used to analyze the contribution of the intermediate processing results of the feature processing layer to the output result. The higher the information content of the internal representation relative to the real output vector, the greater the effect of the internal representation on obtaining the output vector, and the more beneficial the obtaining of the real output result.

As another example, the target text vector may be the input text vector itself, i.e. the amount of information of the internal representation relative to the input text vector is determined in step S103. In this example, the input text vector may be considered a source vector, i.e., the source data used to derive the output result. In the case that the target text vector is an input text vector, the obtained information amount of the internal representation relative to the input text vector may be used to analyze a feature extraction effect of the intermediate processing result of the feature processing layer on the input data, and further, may be used to analyze an effect of the feature processing layer on parts of speech tagging, semantic tagging, and the like performed on the input data.

According to an embodiment of the present disclosure, the probe decoder may include a self-attention processing layer, an encoding-decoding attention processing layer, and a full-connection processing layer. As one example, the Self-Attention processing layer includes a Self-Attention neural Network (SAN) having a neural Network structure based on a Self-Attention mechanism. The encoding-decoding Attention processing layer includes an encoding-decoding Attention neural Network (EAN) having a neural Network structure focusing from a decoder to an Encoder representation. The fully-connected processing layer includes a fully-connected neural Network (FFN). With respect to the specific structure of SAN, EAN, and FFN, it will be described below.

According to an embodiment of the disclosure, the determining the amount of information of the internal representation relative to the target text vector based on the probability value comprises: a Negative log-likelihood (NLL) for characterizing the information quantity is calculated based on the probability value. Determining the negative log-likelihood similarity based on the probability values may be expressed as:

NLL＝-log (P) (1)

where NLL represents the negative log-likelihood similarity and P represents the probability value. The value range of the probability value is [0,1], and since the higher the probability value is, the more information amount representing the target text vector included in the internal representation is, the smaller the value of the NLL calculated based on the probability value is, the more information amount representing the target text vector included in the internal representation is.

FIG. 2 shows a schematic diagram of a probe decoder according to an embodiment of the present disclosure, as shown in the figureAs shown in fig. 2, the neural network on the left receives an input text vector X ═ { X ═ X₁,...,x_M1And processes it to get an output text vector Y ═ Y₁,...,y_M2}. Fig. 2 only schematically shows the feature processing layer in the network, and the internal representation H obtained by the feature processing layer is { H ═ H₁,...,h_M3Supplied to the probe decoder Prober on the right. It should be noted that other required network modules may also be included in the neural network. Generally, M1 is not equal to M2, where M3 may be the same as M1 or M2. For example, in the case where the neural network is a machine translation neural network, the input text vector may be "I love China", and the output text vector translated by the network may be a real translation text vector "I love China", in this example, M1 ═ 4 and M2 ═ 3. In the case where the feature processing layer belongs to an encoder network in a machine translation neural network, M3 ═ M1 ═ 4, and in the case where the feature processing layer belongs to a decoder network in a machine translation neural network, M3 ═ M2 ═ 3.

As shown in fig. 2, SAN in the probe decoder receives the target text vector T ═ T₁,...,t_M2-EAN receiving said internal representation H-H₁,...,h_M3And fitting the received internal representation and the target text vector, namely measuring information of the target text vector contained in the internal representation, and obtaining a final NLL loss based on the probability value so as to represent the information quantity of the internal representation relative to the target text vector.

According to some embodiments of the present disclosure, the neural network is a machine translation neural network for enabling machine translation. For example, through training of the training corpus, the machine translation neural network can be used for realizing translation between languages such as Chinese translation and English translation, English translation and the like, such as translating the 'I love China' into 'I love China' and the like.

In the following, the method provided by the present disclosure will be further described with a machine translation neural network as a specific application example. It will be appreciated that the method according to the present disclosure is not limited to application to the machine translation neural network.

For example, the machine translation neural network (hereinafter, simply referred to as a translation network) has a transform network structure based on a decoding-encoding architecture. The translation network includes an encoder network and a decoder network.

Fig. 3A shows a schematic block diagram of a translation network according to an embodiment of the present disclosure. As shown in fig. 3A, the encoder network includes 6 encoders for processing a received input text vector to obtain an encoded representation of the input text vector. As shown in fig. 3A, the decoder network includes 6 decoders for generating output text vectors. Generally, the decoder network outputs predictive text vectors one by one in time step (time step) units. Specifically, the decoder network receives the encoded representation and the predicted-text vector output at the previous time step to generate the predicted-text vector for the current time step. For example, as shown in fig. 3A, the input text may be "I am a student" and the translation output corresponding to the input text may be "I am a student".

Fig. 3B illustrates a network architecture diagram of a translation network according to an embodiment of the present disclosure. First, an input sentence is word-embedded to be converted into a text vector and provided to an encoder network in a translation network. The encoder network encodes the received input text vector and obtains an encoded representation. The encoded representation may be understood as a representation of information learned by the encoder network from the input text vector.

Only one encoder is shown in fig. 3B, "N x" indicates that the encoder network has N repeated encoders, and the connection can be seen in fig. 3A, for example, N6.

Generally, each encoder has the same network structure, and as shown in fig. 3B, a Multi-head attention layer (Multi-head authentication), an addition and normalization layer (Add & Norm), and a Feed Forward layer (Feed Forward) may be included in the encoder. For example, the multi-head attention layer is formed by stacking a plurality of basic units such as Scaled dot-product attentions (Scaled dot-product attentions), and is used for selecting a plurality of information from input data in parallel, so that each attention focuses on different parts of the input data, and then splicing. For example, the addition normalization layer includes two steps of first adding the input result and the output result, and then performing layer normalization processing on the added result. For example, the normalization process may be performed using a mean and a variance.

The encoder network then provides the encoded representation as an encoder output to the decoder network, as shown in fig. 3B. The decoder network comprises at least one decoder, only one decoder is shown in fig. 3B, "N x" indicates that the decoder network has N repeated decoders, which may be connected in the manner shown in fig. 3A, e.g., N-6. The above example regarding the number of encoders and decoders is only one example, and the number of encoders and decoders in the translation network may be increased or decreased adaptively.

Generally, each decoder has the same network structure, and fig. 3C shows a schematic diagram of a decoder according to an embodiment of the present disclosure, which is the same as the structure of the decoder shown in fig. 3B. According to an embodiment of the present disclosure, a decoder includes a self-attention processing layer, an encoding-decoding attention processing layer, and a fully-connected processing layer.

As an example, as shown in fig. 3C, in the decoder, the self-attention processing layer includes a self-attention neural network (SAN), the encoding-decoding attention processing layer includes an encoding-decoding attention neural network (EAN), and the fully-connected processing layer includes a fully-connected neural network (FFN). The specific structure of SAN, EAN, and FFN included in the decoder in the translation network is the same as that of the above probe decoder.

As shown in fig. 3B and 3C, the SAN in the decoder includes a Masked Multi-head attention layer (Masked Multi-head attention) and an Add & Norm layer (Add & Norm). Under the condition that the decoding network currently outputs the predictive text vector at the t time step, the SAN in the decoder receives the predictive text vector output at the 1- (t-1) time step and outputs the obtained internal representation to the EAN in the decoder. For example, the SAN derived internal representation may be referred to as a SAN internal representation. An EAN in the decoder includes a Multi-head attention layer (Multi-head attention). An EAN in the decoder receives two pieces of information: SAN internal representation and encoded representation of the encoder network output and get EAN internal representation. The FFN in the decoder comprises two addition and normalization layers (Add & Norm) and a Feed Forward layer (Feed Forward), and an FFN internal representation is obtained based on the EAN internal representation. The encoder network further comprises a linear processing layer and a Softmax function for calculating a probability distribution of the predicted text vectors, e.g. the vector with the highest probability value may be used as the predicted text vector for the current time step t.

According to the embodiment of the present disclosure, the feature processing layer referred to in the above step S101 may be a processing layer belonging to the decoder network. For example, the feature handling layer may be an EAN in a decoder, and the internal representation may be the EAN internal representation. In addition, the feature processing layer may also be SAN or FFN in the decoder, and correspondingly, the internal representation may be SAN internal representation or FFN internal representation. Furthermore, the feature processing layer may also be a processing layer in an encoder network, and is not limited herein.

According to an embodiment of the present disclosure, the target text vector is one of: the input text vector; a translated text vector of the input text vector, wherein the input text vector corresponds to a first language and the translated text vector corresponds to a second language different from the first language. Wherein the translated text vector is a true output vector of the input text vector.

In the case where the neural network is the translation network shown in fig. 3B and the target text vector is a translated text vector, obtaining a probability value with the probe decoder and calculating the negative log-likelihood similarity based on the probability value may be further expressed as:

NLL＝-log(P_prober(y_t|y_＜t；θ_prober)) (2)

wherein, theta_proberRepresenting parameters associated with the probe decoder. P_proberRepresenting probability values calculated by the probe decoder, i.e. text vectors (y) at positions 0 to (t-1)_＜t) And probeParameters (theta) for decoder_prober) As a condition to predict y_tThe probability of (c). The probe decoder may be trained separately using parallel corpora before the amount of information is calculated using the probe decoder.

According to an embodiment of the present disclosure, the target text vector is a translated text vector of the input text vector, the method further comprising: changing a network structure of the machine translation neural network based on an amount of information of the internal representation relative to the target text vector. According to an embodiment of the present disclosure, the fully-connected processing layer includes an addition normalization layer and a feed-forward layer, wherein the changing the network structure of the machine translation neural network includes: determining a first amount of information of the internal representation of the fully-connected processing layer relative to the target text vector, and determining a second amount of information and a third amount of information of the internal representation of the addition-normalization layer and the feedforward layer in the fully-connected processing layer relative to the target text vector, respectively; determining to delete at least a portion of the fully-connected processing layers based on the first, second, and third amounts of information.

As described above, in case the target text vector is a real output vector (i.e. a translated text vector), the amount of information obtained for the internal representation relative to the translated text vector may be used to analyze the contribution of the internal representation of the feature processing layer (e.g. EAN, SAN, FFN in the decoder) to the output result. The higher the information content of the internal representation relative to the translated text vector, the greater the effect of the internal representation on obtaining the output vector, and the more beneficial the true output result is. By analyzing the evolution process of the internal representation output by the feature processing layer in the translation network, the effect of the feature processing layer on the whole translation task can be obtained, and correspondingly, improvement can be made, such as deleting a part of the network structure which has less influence on the output result.

As an example, fig. 4 shows a schematic diagram of a variation of an information amount of a fully connected processing layer (FFN) according to an embodiment of the present disclosure to describe a process of improving a translation network based on an information amount determined according to a method of the present disclosure.

Fig. 4(a) shows the change curves of the internal representation obtained by FFN (shown as upper triangle) in the decoder of fig. 3C, feedforward layer (shown as lower triangle) in FFN, and Add & Norm (shown as square) in FFN with respect to NLL of the input text vector, respectively, where the horizontal axis of the graph is the number of layers, and N is 6 layers as an example, i.e., the decoder network includes 6 decoders, and the vertical axis of the graph is the calculated NLL.

In general, FFN is considered to be used to fuse source (from EAN) and target (from SAN) information, and analyzing the curve in fig. 4(a), it can be seen that the internal representation resulting from Add & Norm operations contains no less than the amount of input information included in the internal representation resulting from the final FFN. This indicates that Add & Norm operations are sufficient to fuse the internal representations of EAN and SAN.

Fig. 4(b) shows a graph of the change in BLEU with respect to the translated text vector for FFN (denoted as upper triangle) in the decoder, feedforward layer (denoted as lower triangle) in FFN, and Add & Norm (denoted as square) in FFN, respectively, where the horizontal axis of the graph is the number of layers, and N ═ 6 layers are used as an example, i.e., the decoder network includes 6 decoders, and the vertical axis of the graph is the calculated BLEU. Compared with NLL, BLEU is a standard method for evaluating machine translation and is used for representing the accuracy of machine translation, and the higher the BLEU value is, the more accurate the translation result is. Analysis of the curve in FIG. 4(b) reveals that the feed forward layer in FFN contributes less to the accuracy of the output result than Add & Norm.

Based on this, it may be attempted to delete both Add & Norm in FFN and at least part of the feedforward layer, e.g. only one Add & Norm process in FNN is retained, i.e. based on the method according to the present disclosure to improve the network structure of the translation network.

Table 1 below records the translation accuracy (BLEU), decoder parameters (# Param in million), and decoding Speed (# Speed in sentence/minute) of the translation network before and after improvement on the english translation (denoted EN-DE), the english translation (denoted EN-CN), and the english translation method (denoted EN-FR) test sets, respectively.

As shown in table 1, the translation accuracy of the improved translation network is substantially the same as that of the translation network before the improvement, but the parameters are significantly smaller than that of the translation network before the improvement, so that the decoding rate is greatly improved. Therefore, the improvement of the network structure of the translation network based on the information quantity is beneficial to improving the translation rate while ensuring the translation accuracy.

TABLE 1

By using the method disclosed by the invention, the information transmission process of each module in the network relative to the input end or the output end can be understood based on the obtained information quantity change trend among each processing module in the neural network, so that the improvement of the network structure is purposefully realized based on the understanding of the network to optimize the network structure.

Further, fig. 5 shows a schematic diagram of an information amount with respect to an input text vector according to an embodiment of the present disclosure, and fig. 6 shows a schematic diagram of an information amount with respect to a translation text vector according to an embodiment of the present disclosure. Based on the methods provided according to the present disclosure, experiments were performed on a plurality of different translation data sets, respectively, including english translation (EN-DE), english translation (EN-CN), and english translation (EN-FR), and are represented as (a), (b), (c) in fig. 5 and 6, respectively.

Specifically, fig. 5 and 6 show the variation curves of the internal representation obtained by FFN (denoted as upper triangle), EAN (denoted as lower triangle) and SAN (denoted as square block) in the decoder of fig. 3C with respect to the NLL of the input text vector, respectively.

Based on the information volume change curve characterized by NLL shown in fig. 5, the information volume evolution process of FFN, EAN, SAN from lower layer to higher layer in each decoder in the translation network with respect to the input data (i.e., the target text vector is the input text vector) can be analyzed. Based on the information volume change curve characterized by NLL shown in fig. 6, the information volume evolution process of FFN, EAN, SAN from lower layer to higher layer in each decoder in the translation network with respect to the translation data (i.e., the target text vector is a translation text vector) can be analyzed.

First, it can be seen in fig. 5 and 6 that the SAN information amount variation trend and the FFN information amount trend are approximately the same, and furthermore, the value on the SAN variation curve is almost the right shift of the FFN variation curve by one bit, which indicates that SAN has no influence on the information amount of the input terminal and the output terminal, and also indicates that EAN plays a decisive role on the information amount of the input terminal and the output terminal because the information sources of the decoder are only SAN and EAN. In addition, for the information quantity variation curve of EAN, it can be seen that the low-level EAN (1-3 levels) and the high-level EAN (4-6 levels) have obvious evolution tendency, the low-level EAN obtains more and more input information, and the input information included in the high-level EAN is gradually reduced.

By using the method for determining the internal representation information quantity of the neural network, a probe decoder can be used for performing fitting processing on the target text vector and the internal representation generated by the feature processing layer in the neural network to obtain a probability value, and the information quantity of the internal representation relative to the target text vector is determined based on the probability value. The determined information quantity can be used for analyzing information learned by each network module in the neural network based on input vectors, and further obtaining information transmission processes among the modules in the neural network, so that the functions of each network module in the neural network on output results can be deeply understood, and the improvement of a network structure is purposefully realized based on the understanding of the network.

The disclosure also provides a device for determining the quantity of information expressed inside the neural network. Fig. 7 in particular shows a schematic block diagram of a neural network internal representation information quantity determination apparatus according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus 1000 may include an internal representation unit 1010, a probability unit 1020, and an information amount calculation unit 1030.

According to an embodiment of the present disclosure, the internal representation unit 1010 may be configured to process an input text vector using the neural network and extract an internal representation generated by a feature processing layer in the neural network. The probability unit 1020 may be configured to perform a fitting process on the target text vector and the internal representation by using a probe decoder to obtain a probability value, wherein the probability value represents a probability of mapping from the internal representation to the target text vector. The information amount calculation unit 1030 may be configured to determine the amount of information of the internal representation relative to the target text vector based on the probability value.

According to some embodiments of the present disclosure, the information amount calculating unit 1030 may be configured to calculate a negative log likelihood similarity for characterizing the information amount based on the probability value.

According to some embodiments of the present disclosure, the target text vector is a translated text vector of the input text vector. As shown in fig. 7, the apparatus 1000 may further include a modification unit 1040. The refinement unit 1040 may be configured to change a network structure of the machine translation neural network based on an amount of information of the internal representation relative to the target text vector.

According to an embodiment of the present disclosure, the fully-connected processing layer includes an addition normalization layer and a feed-forward layer. The improving unit 1040 may be configured to: determining a first information amount of an internal representation of the fully-connected processing layer relative to the target text vector, and determining a second information amount and a third information amount of the internal representations of the addition normalization layer and the feedforward layer in the fully-connected processing layer relative to the target text vector, respectively; determining to delete at least a portion of the fully-connected processing layers based on the first, second, and third amounts of information.

According to some embodiments of the disclosure, the target text vector is one of: the input text vector; a translated text vector of the input text vector, wherein the input text vector corresponds to a first language and the translated text vector corresponds to a second language different from the first language. For example, the first language may be Chinese and the second language may be English, German, etc., which is different from Chinese.

The steps performed by the apparatus 1000 may refer to the neural network internal representation information amount determination method according to the present disclosure described above in conjunction with the drawings, and the description is not repeated here.

According to still another aspect of the present disclosure, there is also provided a neural network internal representation information amount determination device. Fig. 8 shows a schematic block diagram of a neural network internal representation information amount determination device according to an embodiment of the present disclosure.

As shown in fig. 8, the device 2000 may include a processor 2010 and a memory 2020. The memory 2020 has stored therein computer readable code that, when executed by the processor 2010, performs the neural network internal representation information amount determination method as described above, in accordance with the disclosed embodiments.

Processor 2010 may perform various actions and processes in accordance with programs stored in memory 2020. In particular, processor 2010 may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. Various methods, steps and logic blocks disclosed in embodiments of the invention may be implemented or performed. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which may be the X86 architecture or the ARM architecture or the like.

The memory 2020 stores computer-executable instruction code that, when executed by the processor 2010, is for implementing a neural network internal representation information amount determination method according to an embodiment of the present disclosure. The memory 2020 may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

Methods or apparatus in accordance with embodiments of the present disclosure may also be implemented by way of the architecture of computing device 3000 shown in FIG. 9. As shown in fig. 9, computing device 3000 may include a bus 3010, one or more CPUs 3020, a Read Only Memory (ROM)3030, a Random Access Memory (RAM)3040, a communication port 3050 to connect to a network, input/output components 3060, a hard disk 3070, and the like. A storage device in the computing device 3000, such as the ROM 3030 or the hard disk 3070, may store various data or files used for processing and/or communication of the neural network internal representation information amount determination method provided by the present disclosure, and program instructions executed by the CPU. Computing device 3000 can also include user interface 3080. Of course, the architecture shown in FIG. 9 is merely exemplary, and one or more components of the computing device shown in FIG. 9 may be omitted when implementing different devices, as desired.

According to yet another aspect of the present disclosure, there is also provided a computer-readable storage medium. Fig. 10 shows a schematic diagram 4000 of a storage medium according to the present disclosure.

As shown in fig. 10, the computer storage media 4020 has stored thereon computer readable instructions 4010. The computer readable instructions 4010, when executed by a processor, can perform the neural network internal representation information amount determination method described with reference to the above figures. The computer-readable storage medium includes, but is not limited to, volatile memory and/or non-volatile memory, for example. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. For example, the computer storage medium 4020 may be connected to a computing device such as a computer, and then, in the case where the computing device executes the computer-readable instructions 4010 stored on the computer storage medium 4020, the neural network internal representation information amount determination method as described above may be performed.

Those skilled in the art will appreciate that the disclosure of the present disclosure is susceptible to numerous variations and modifications. For example, the various devices or components described above may be implemented in hardware, or may be implemented in software, firmware, or a combination of some or all of the three.

Further, while the present disclosure makes various references to certain elements of a system according to embodiments of the present disclosure, any number of different elements may be used and run on a client and/or server. The units are illustrative only, and different aspects of the systems and methods may use different units.

It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, and the program may be stored in a computer readable storage medium, such as a read only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present disclosure is not limited to any specific form of combination of hardware and software.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present disclosure and is not to be construed as limiting thereof. Although a few exemplary embodiments of this disclosure have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this disclosure. Accordingly, all such modifications are intended to be included within the scope of this disclosure as defined in the claims. It is to be understood that the foregoing is illustrative of the present disclosure and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The disclosure is defined by the claims and their equivalents.

Claims

1. A neural network internal representation information quantity determination method comprises the following steps:

processing an input text vector by using the neural network, and extracting an internal representation generated by a feature processing layer in the neural network;

fitting a target text vector and the internal representation by using a probe decoder to obtain a probability value, wherein the probability value represents the probability of mapping the internal representation into the target text vector; and

determining an amount of information of the internal representation relative to the target text vector based on the probability value.

2. The method of claim 1, wherein the probe decoder includes a self-attention processing layer, an encoding-decoding attention processing layer, and a fully-connected processing layer.

3. The method of claim 1, wherein the neural network is a machine translation neural network comprising an encoder network and a decoder network, the decoder network comprising at least one decoder comprising a self-attention processing layer, an encode-decode attention processing layer, and a fully-connected processing layer,

wherein the feature processing layer is a processing layer belonging to the decoder network.

4. The method of claim 1, wherein the determining an amount of information of the internal representation relative to the target text vector based on the probability value comprises:

and calculating the negative log likelihood similarity for representing the information quantity based on the probability value.

5. The method of claim 3, wherein the target text vector is a translated text vector of the input text vector, the method further comprising:

changing a network structure of the machine translation neural network based on an amount of information of the internal representation relative to the target text vector.

6. The method of claim 5, wherein the fully-connected processing layer comprises an add-and-normalize layer and a feed-forward layer, wherein the changing the network structure of the machine translation neural network comprises:

determining a first information amount of an internal representation of the fully-connected processing layer relative to the target text vector, and determining a second information amount and a third information amount of the internal representations of the addition normalization layer and the feedforward layer in the fully-connected processing layer relative to the target text vector, respectively;

determining to delete at least a portion of the fully-connected processing layers based on the first, second, and third amounts of information.

7. The method of claim 3, wherein the target text vector is one of:

the input text vector;

a translated text vector of the input text vector, wherein the input text vector corresponds to a first language and the translated text vector corresponds to a second language different from the first language.

8. An apparatus for determining an amount of information represented inside a neural network, comprising:

an internal representation unit configured to process an input text vector by using the neural network and extract an internal representation generated by a feature processing layer in the neural network;

a probability unit configured to perform fitting processing on a target text vector and the internal representation by using a probe decoder to obtain a probability value, wherein the probability value represents a probability of mapping from the internal representation to the target text vector; and

an information amount calculation unit configured to determine an amount of information of the internal representation with respect to the target text vector based on the probability value.

9. The apparatus of claim 8, wherein the probe decoder comprises a self-attention processing layer, an encoding-decoding attention processing layer, and a full-connection processing layer.

10. The apparatus of claim 8, wherein the neural network is a machine translation neural network comprising an encoder network and a decoder network, the decoder network comprising at least one decoder comprising a self-attention processing layer, an encode-decode attention processing layer, and a fully-connected processing layer,

11. The apparatus according to claim 8, wherein the information amount calculation unit is configured to:

12. The apparatus of claim 10, wherein the target text vector is a translated text vector of the input text vector, the apparatus further comprising a refinement unit configured to:

changing a network structure of the machine translation neural network based on an amount of information of the internal representation relative to the target text vector, wherein the fully-connected processing layer includes an addition-normalization layer and a feed-forward layer, the improvement unit configured to:

determining a first amount of information of the internal representation of the fully-connected processing layer relative to the target text vector, and determining a second amount of information and a third amount of information of the internal representation of the addition-normalization layer and the feedforward layer in the fully-connected processing layer relative to the target text vector, respectively;

13. The apparatus of claim 10, wherein the target text vector is one of:

the input text vector;

14. A neural network internal representation information amount determination device comprising:

a processor;

memory, wherein the memory has stored therein computer readable code which, when executed by the processor, performs a neural network internal representation information quantity determination method as claimed in any one of claims 1-7.

15. A computer-readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to execute the neural network internal representation information amount determination method according to any one of claims 1 to 7.