CN115495550A

CN115495550A - Chinese semantic similarity calculation method based on multi-head attention twin Bi-LSTM network

Info

Publication number: CN115495550A
Application number: CN202211075127.0A
Authority: CN
Inventors: 汪忠国; 张宝
Original assignee: Anhui Institute of Information Engineering
Current assignee: Anhui Institute of Information Engineering
Priority date: 2022-09-03
Filing date: 2022-09-03
Publication date: 2022-12-20

Abstract

The invention relates to the technical field of information retrieval and data mining, in particular to a Chinese semantic similarity calculation method based on a Multi-head attention twin Bi-LSTM network. The invention provides a Chinese text similarity calculation model MAS-Bi-LSTM (Multi-attentional SiemesBi-LSTM) based on a Multi-head attention mechanism, which utilizes a symmetrical Bi-LSTM structure to calculate semantic features of each text, and simultaneously carries out re-weighting on the semantic features through the Multi-head attention mechanism, thereby effectively capturing semantic information among words in a sentence, and making up the deficiency of the global processing capability of a bidirectional RNN by combining with the global feature weighting of the Multi-head attention mechanism.

Description

Chinese semantic similarity calculation method based on multi-head attention twin Bi-LSTM network

Technical Field

The invention relates to the technical field of information retrieval and data mining, in particular to a Chinese semantic similarity calculation method based on a multi-head attention twin Bi-LSTM network.

Background

The mainstream text similarity calculation methods mainly include the following categories: a text matching similarity calculation method, a text semantic similarity calculation method, a deep neural network-based similarity calculation method and the like;

the surface text similarity calculation method comprises the following steps: the text matching similarity calculation method is a simple calculation method, and the method is to simply compare word sequences or character sequences in texts, and take the text matching degree or distance as a similarity judgment standard. The text matching similarity calculation method has a simple principle and strong interpretability, for example, a method based on the maximum Common subsequence (LCS) calculation of character matching calculates the similarity of texts by comparing the lengths of the maximum Common substrings of characters in two texts; the computing method based on the vector space model comprises the steps of firstly converting a One-Hot coding (One-Hot) coding or TF-IDF algorithm of a text into a text vector, and then judging the Similarity degree between two texts according to the Distance between the text vectors calculated by an equidistant formula of Manhattan space Distance (Manhattan Distance), cosine Similarity (Cosine Similarity) and Euclidean Distance (Euclidean Distance);

and (3) semantic similarity calculation: the text matching similarity calculation only considers the case of character sequence surface matching, and completely ignores the difference between words in specific contexts. Therefore, researchers have proposed semantic similarity calculation methods considering specific contexts, and text similarity calculation based on corpus training is the mainstream method. The method comprises the steps of training texts through a large-scale corpus to obtain semantic vectors, and judging the similarity degree between the texts by calculating the distance of the semantic vectors. Commonly used text Semantic similarity calculation methods include LSA (content Semantic Analysis) and LDA (content Dirichlet Allocation). The semantic similarity calculation and the traditional vector space model both use vectors to represent words and documents, and judge the relationship between texts through the relationship between the vectors, but the semantic similarity calculation reduces the semantic dimension through vector space mapping, reduces the complexity of model calculation, eliminates partial noise in the dimension reduction process, and improves the accuracy of text retrieval;

calculating the text similarity based on the neural network: the deep neural network analyzes the text by fine granularity such as characters and words to obtain a low-dimensional vector. A commonly used training model is a Word2vec model proposed by Mikolov et al, and Word2vec has two training modes, skip-gram for predicting context by inputting words and CBOW for predicting words by context. The CBOW training mode of the Word2vec model predicts words by using context information of input words and converts the words into a low-dimensional space vector, words with more similar semantics are closer in a vector space, and GloVe, fastText and the like are more popular Word vector generation tools. Deep neural network models such as CNN, LSTM, bi-LSTM and the like are gradually and widely applied to Chinese text similarity calculation in recent years, and good effects are obtained.

The twin network is mainly used in the field of pattern recognition such as face recognition and automatic driving. Wangling et al propose a twin network target tracking algorithm with a fused attention mechanism, better deal with the problems of motion blur, target drift, variable background and the like, and obtain higher accuracy and success rate. In recent years, the twin deep learning neural network is gradually used in text similarity calculation, and better effect is obtained. Guohao et al propose that twin CNN superposes LSTM and combines with the Attention mechanism to obtain text semantic representation vectors in a weighting manner, and finally, the text similarity is calculated through cosine similarity. Zhao Chengding et al use asymmetric twin Bi-LSTM networks to analyze the relevance of news and cases. Bao et al use the twin LSTM network of the Attention mechanism to make a proportional study on the similarity of Tibetan, chinese and English, respectively, and find that the use of the Attention mechanism can effectively improve the performance of the twin LSTM network through comparison.

The language of text similarity research based on the twin network model mainly focuses on English, and the lack of a high-quality Chinese text similarity corpus results in relatively few researches based on Chinese text similarity and low reference value.

Therefore, a Chinese semantic similarity calculation method based on the multi-head attention twin Bi-LSTM network is provided.

Disclosure of Invention

The invention aims to provide a Chinese semantic similarity calculation method based on a multi-head attention twin Bi-LSTM network, and a high-quality Chinese semantic similarity corpus LCQMC is published on an ACL according to the 2018 of Harbin university of industry. Based on the corpus, the invention provides a Chinese text similarity calculation model MAS-Bi-LSTM (Multi-attention Simense Bi-LSTM) based on a Multi-head attention mechanism, semantic features of each text are calculated by using a symmetrical Bi-LSTM structure, and meanwhile, the semantic features are re-weighted by the Multi-head attention mechanism, so that semantic information among words in a sentence is effectively captured, and the shortage of bidirectional RNN global processing capability is made up by combining with Multi-head attention mechanism global feature weighting, so that the problem provided in the background technology is solved.

In order to achieve the purpose, the invention provides the following technical scheme: a Chinese semantic similarity calculation method based on a Multi-head attention twin Bi-LSTM network is characterized in that a Chinese text similarity calculation model MAS-Bi-LSTM (Multi-attribute Simase Bi-LSTM) model constructed based on a Chinese semantic similarity corpus LCQMC (LCQMC) comprises an input layer, an embedding layer, a twin network layer and a similarity calculation layer, and comprises the following steps:

s1: firstly, according to a Word embedding model theory, utilizing a pre-training Word vector generated by a Word2Vec technology to obtain a Word vector of each Chinese participle;

s2: secondly, outputting a weighted word vector combination of the Chinese sentences on the LCQMC of the universal corpus based on a twin Bi-LSTM network model of a multi-head attention mechanism;

s3: and finally, outputting the similarity value of each group of semantic sequences through a Manhattan spatial distance algorithm.

Preferably, the input layer is mainly used for preprocessing the text a and the text b and taking the preprocessing result as the input of the embedding layer. Taking the processing of text a as an example (similar processing of text b), the input layer firstly uses the Jieba thesaurus to perform word segmentation processing, then uses the stop thesaurus to perform stop word removal, then counts the segmented text to generate a document dictionary, and fills the sequence to keep the length of the input text sequence consistent.

Preferably, the maximum length of the text sequence is L =200, and the text sequence is truncated if it is larger than L and is filled with a value of 0 if it is smaller than L. After preprocessing, text a can be expressed as S _a ＝{C ₁ ,C ₂ ,...,C _L Where L is the maximum length of the text sequence, C _i For each word segmentation result.

Preferably, the embedding layer uses Skip-Gram model in Word2Vec to input S generated by the layer _a ，S _b Each C in _i Into word vectors E _i And is used as the input of the next layer of twin Bi-LSTM network. The Skip-Gram model predicts source words through target word collection, the number of nerve units of a hidden layer in the model represents the dimension of each word represented by a vector, the output layer obtains the probability of each prediction result by using a sotfmax function, and the model adopts a cross entropy loss function to perform gradient descent algorithm optimization to obtain a weight matrix W. Each word vector E _i The calculation is performed by the following formula:

E _i ＝x _i W _V×N

wherein x is _i As word C _i One-hot encoding based on a vocabulary index; v is the length of one-hot coding, namely the length of a word list index; n is the dimension of the word vector, which is 300 for this model.

Preferably, the vector of the twin Bi-LSTM network output is represented as H = [ H ] ₁ ，h ₂ ，h ₃ ，...，h _n ]The multi-head attention mechanism obtains a weighted summation result by carrying out series operation on vector expression output by the neural network, and the weight represents the importance degree of the characteristic. The attention mechanism is divided into three steps.

Preferably, the attention mechanism comprises three steps as follows:

the first step is as follows: h of Bi-LSTM output _i Transmitting to a full connection layer, and obtaining attention weight value mu _i ：

μ _i ＝tanh(W _h h _i )

Wherein, W _h Is the coefficient of the attention model calculation weight, tanh is the activation function.

The second step: normalizing the weight to obtain a directly usable weight alpha _i The concrete formula is as follows:

wherein, λ is a coefficient value, and the calculated α value represents the importance degree of each word vector in the sentence.

The third step: and carrying out weighted summation on the weight and the value to obtain a semantic vector Si weighted by an Attention mechanism:

S _i ＝∑ _i α _i h _i

wherein alpha is _i The weight value of each word vector calculated for the second step.

The multi-head attention mechanism is that the attention mechanism is repeated for multiple times, namely, the attention mechanism is called multi-head, parameters of each head are not shared, then, a plurality of S are spliced, and finally, the final output of the multi-head attention mechanism, namely, semantic sequence vector representation of each sentence of an input layer is obtained through linear transformation.

Preferably, the output of the multi-head attention mechanism layer is a semantic sequence vector S _a ，S _b Similarity calculation layer mainly calculates S _a And S _b Degree of similarity in semantic space. The invention calculates the similarity value of two sentences by taking the Manhattan space distance as an evaluation standard, and the range is [0,1 ]]In between.

similarity＝exp(-|S _a -S _b |)

The output results are more than 0.5 and are considered similar, and are marked as 1; considered dissimilar at 0.5 or less, and labeled as 0 °

Preferably, the effectiveness of the MAS-Bi-LSTM model is verified, and a CNN and RNN common deep learning model is selected for comparison in experiments and is respectively five models of TextCNN, GRU, bi-GRU, LSTM and TextCNN (MA), GRU (MA) and Bi-GRU (MA) added with a multi-head attention mechanism.

Preferably, the word embedding layer selects a pre-training word vector based on Chinese Wikipedia, the number of heads of a multi-head attention mechanism selects 4, and the distance formula selects a Manhattan space distance formula.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a Chinese text similarity calculation model MAS-Bi-LSTM (Multi-attention Simense Bi-LSTM) based on a Multi-head attention mechanism, which utilizes a symmetrical dual-attention Bi-LSTM structure to calculate semantic features of each text, and simultaneously carries out re-weighting on the semantic features through the Multi-head attention mechanism, thereby effectively capturing semantic information among words in a sentence, and making up the deficiency of the global processing capability of a bidirectional RNN by combining with the global feature weighting of the Multi-head attention mechanism.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a diagram of MAS-Bi-LSTM model of the present invention;

fig. 2 is a diagram of an LSTM network structure of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Referring to fig. 1 to 2, the present invention provides a technical solution:

a Chinese semantic similarity calculation method based on a multi-head attention twin Bi-LSTM network is characterized in that a high-quality Chinese semantic similarity corpus LCQMC is published on an ACL based on the West university of Harbin industry in 2018. The Chinese text similarity calculation model MAS-Bi-LSTM (Multi-attribute Simase Bi-LSTM) model constructed based on the corpus comprises an input layer, an embedding layer, a twin network layer and a similarity calculation layer (shown in figure 1), and the Chinese semantic similarity calculation method based on the twin network comprises the following steps:

s2: secondly, outputting a weighted word vector combination of the Chinese sentence on the LCQMC of the universal corpus based on a twin Bi-LSTM network model of a multi-head attention mechanism;

s3: and finally, outputting the similarity value of each group of semantic sequences through a Manhattan spatial distance algorithm. The method provided by the invention obtains the F1 value of 0.8076, is superior to a classical deep learning model in a comparison experiment, and can be used for training different corpora subsequently, so that the adaptability of the model to different scenes is enhanced.

The task of the MAS-Bi-LSTM model is to judge whether the semantics of two input sentences are similar, and the model structure is shown in FIG. 1. As can be seen from FIG. 1, the MAS-Bi-LSTM model firstly processes two problems to be compared, namely text a and text b through Chinese participles and stop words, then converts Word vectors pre-trained through Word2Vec into sentence vectors in an embedding layer, and generates semantic expression vectors S after the sentence vectors pass through a symmetrical twin Bi-LSTM network sharing weight coefficients and a multi-head attention mechanism is applied _a And S _b And finally calculating by using a Manhattan space distance algorithm to obtain S _a And S _b The similarity of (c).

The input layer is mainly used for preprocessing the text a and the text b and taking the preprocessing result as the input of the embedded layer. Taking the processing of text a as an example (similar processing of text b), the input layer firstly uses the Jieba thesaurus to perform word segmentation processing, then uses the stop thesaurus to perform stop word removal, then counts the segmented text to generate a document dictionary, and fills the sequence to keep the length of the input text sequence consistent.

The maximum length of the text sequence used in the present invention is L =200, truncation greater than L and filling with a value of 0 less than L. After preprocessing, text a can be expressed as S _a ＝{C ₁ ,C ₂ ,...,C _L And h, wherein L is the maximum length of the text sequence, and Ci is each word segmentation result.

The embedding layer uses Skip-Gram model in Word2Vec to input S generated by the layer _a ，S _b Each C in _i Conversion into word vectors E _i And is used as the input of the next layer of twin Bi-LSTM network. The Skip-Gram model predicts source words through target word collection, the quantity of nerve units of a hidden layer in the model represents the dimension of each word represented by a vector, the output layer obtains the probability of each prediction result by using a sotfmax function, and the model selects a cross entropy loss function to perform gradient descent algorithm optimization and obtain a weight matrix W. Each word vector E _i The calculation is performed by the following formula:

E _i ＝x _i W _V×N

wherein x is _i One-hot encoding for the word Ci based on the vocabulary index; v is the length of one-hot coding, namely the length of a word list index; n is the dimension of the word vector, which is 300 for this model.

The twin neural network concept is proposed in the field of image recognition for judging the similarity of two pictures. The twin network can also measure the similarity of two texts, and the neural network for processing the texts can generally adopt deep learning models such as RNN (e.g. LSTM, GRU and the like) and CNN.

The LSTM model adopted by the invention is firstly proposed by Hochreiter and Schmidhuber, is mainly used for solving the problem of gradient disappearance or gradient explosion of the model in a Recurrent Neural Network (RNN), and effectively solves the problem of long-distance memory in the RNN by adding a plurality of gates in the RNN. The storage unit of the LSTM model includes three gating systems, such as a forgetting gate, an input gate, and an output gate, and the LSTM controls the transmission and selection of information through the three gating systems, as shown in fig. 2.

the formula for LSTM updating the memory cell at time t is shown as follows:

i _t ＝σ(W _i h _i-1 +U _i x _t +b _i )

f _t ＝σ(W _f h _t-1 +U _f x _t +b _f )

o _t ＝σ(W _o h _t-1 +U _o x _t +b _o )

h _t ＝o _t tanh(c _t )

wherein x is _t Is a vector input via the embedding layer, h _t Is the hidden state at time t; b is a mixture of _i ，b _f ，b _c ，b _o Is a bias vector value; w _i ，W _f ，W _c ，W _o ，U _i ，U _f ，U _c ，U _o Respectively as a weight matrix of each gate control unit; σ is a Sigmoid function.

Bi-LSTM can better capture bidirectional semantic dependence. Performing LSTM operation on Bi-LSTM from two directions, and calculating the positive direction to obtain h _t H 'obtained in the reverse direction' _t Splicing to obtain an output result S _i ，S _i The formula (c) is as follows:

S _i ＝Concat(W _a h _t ,W _b h' _t )

attention is paid to the enormous success that neural networks of the mechanism have had in many NLP tasks. The attention mechanism is a means for taking advantage of a selective mechanism of human vision, and the human vision focuses on part of target areas in a rapid scanning global image and screens out valuable information by using limited attention.

The vector output by the twin Bi-LSTM network is represented by H = [ H1, H2, H3., hn ], the multi-head attention mechanism obtains a weighted summation result by carrying out series of operations on the vector expression output by the neural network, and the weight represents the importance degree of the characteristic. The three steps of the attention mechanism are as follows:

the first step is as follows: hi output from Bi-LSTM is transmitted to a full connection layer, and the attention weight value mu is obtained _i ：

μ _i ＝tanh(W _h h _i )

The second step is that: normalizing the weight to obtain a directly usable weight alpha _i The concrete formula is as follows:

wherein λ is a coefficient value, and the calculated α value represents the importance of each word vector in the sentence.

S _i ＝∑ _i α _i h _i

The output of the multi-head attention machine mechanism layer is a semantic sequence vector S _a ，S _b The similarity calculation layer mainly calculates S _a And S _b Degree of similarity in semantic space. The invention takes the Manhattan space distance asEvaluation criterion, calculating similarity value of two sentences in the range of [0, 1%]In the meantime.

similarity＝exp(-|S _a -S _b |)

The output results are more than 0.5 and are considered similar, and are marked as 1; a value of 0.5 or less is considered to be dissimilar and is denoted as 0.

The effectiveness of the MAS-Bi-LSTM model is verified, and a CNN and RNN common deep learning model is selected for comparison in experiments and is respectively a TextCNN model, a GRU model, a Bi-GRU model, an LSTM model and a TextCNN (MA) model, a GRU model and a Bi-GRU Model (MA) model added with a multi-head attention mechanism. The word embedding layer selects a pre-training word vector based on Chinese Wikipedia, the head number of a multi-head attention mechanism selects 4, the distance formula selects a Manhattan space distance formula, and the experimental result is shown in table 1.

TABLE 1 comparison with other models

As can be seen from Table 1, the Bi-GRU and Bi-LSTM of the bidirectional RNN network have better effects than those of the GRU and LSTM of the unidirectional RNN network, and the Precision, recall and F1 values are respectively improved by 1.7%, 1.51% and 1.61%. The bidirectional RNN structure captures the dependency relationship of words in a sentence from positive and negative directions, and can better mine the internal semantic information of the text. The convolutional neural network CNN is obviously inferior to the circular neural network in experimental performance in the insufficient capability of capturing text time sequence information, especially long sequence language information. The GRU simplifies the gating mechanism of LSTM, and although training speed is improved in experiments, the performance of the model is sacrificed. After the feature weighting processing of the self-attention mechanism is introduced, precision, recall and F1 values of the CNN and RNN neural networks are obviously improved, and the average value is improved by 5.53%, 12.27% and 8.64%.

The MA-BLSTM provided by the invention is best in performance in a comparison model, twin Bi-LSTM in the model has a memory function of bidirectionally capturing time sequence information and the processing capacity of long sequence semantic information, meanwhile, the multi-head attention mechanism global feature weighting makes up the deficiency of the global processing capacity of the Bi-LSTM, and Precision, recall and F1 values all obtain the highest values of 0.7499, 0.8749 and 0.8076 in the comparison model.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A Chinese semantic similarity calculation method based on a Multi-head attention twin Bi-LSTM network is characterized in that a Chinese text similarity calculation model MAS-Bi-LSTM (Multi-attribute Simence Bi-LSTM) model constructed based on a Chinese semantic similarity corpus LCQMC (least-attribute Simence Bi-LSTM) is provided, the MAS-Bi-LSTM model comprises an input layer, an embedding layer, a twin network layer and a similarity calculation layer, and the Chinese semantic similarity calculation method based on the twin network comprises the following steps:

s1: firstly, according to a Word embedding model theory, utilizing a Word vector pre-trained by a Word2Vec technology to obtain a Word vector of each Chinese Word segmentation;

2. The method for calculating the similarity of Chinese semantics based on the multi-head attention twin Bi-LSTM network according to claim 1, wherein: the input layer is mainly used for preprocessing input texts text a and text b, and the preprocessing result is used as the input of the embedding layer. Taking the processing of text a as an example (similar processing of text b), the input layer firstly uses the Jieba thesaurus to perform word segmentation processing, then uses the stop thesaurus to perform stop word removal, then counts the segmented text to generate a document dictionary, and fills the sequence to keep the length of the input text sequence consistent.

3. The method for calculating the similarity of Chinese semantics based on the multi-head attention twin Bi-LSTM network according to claim 1, wherein: the maximum length of the text sequence is L =200, the text sequence is truncated when the maximum length is larger than L, and the text sequence smaller than L is filled with a value of 0. After preprocessing, text a can be expressed as S _a ＝{C ₁ ,C ₂ ,...,C _L Where L is the maximum length of the text sequence, C _i For each word segmentation result.

4. The method for calculating the similarity of Chinese semantics based on the multi-head attention twin Bi-LSTM network according to claim 3, wherein: the embedding layer uses Skip-Gram model in Word2Vec to input S generated by the layer _a ，S _b Each C in _i Conversion into word vectors E _i And is used as the input of the next layer of twin Bi-LSTM network. The Skip-Gram model predicts source words through target word collection, the quantity of nerve units of a hidden layer in the model represents the dimension of each word represented by a vector, the output layer obtains the probability of each prediction result by using a sotfmax function, and the model selects a cross entropy loss function to perform gradient descent algorithm optimization and obtain a weight matrix W. Each word vector E _i The calculation is performed by the following formula:

E _i ＝x _i W _V×N

5. The method for calculating the similarity of Chinese semantics based on the multi-head attention twin Bi-LSTM network according to claim 1, wherein: the vector representation of the twin Bi-LSTM network output is H = [ H ] ₁ ，h ₂ ，h ₃ ，...，h _n ]The multi-head attention mechanism obtains a weighted summation result by carrying out series operation on vector expression output by the neural network, and the weight represents the importance degree of the characteristic. The attention mechanism is divided into three steps.

6. The method for calculating the similarity of Chinese semantics based on the multi-head attention twin Bi-LSTM network according to claim 5, wherein: the attention mechanism comprises the following three steps:

μ _i ＝tanh(W _h h _i )

The second step: normalizing the weight value to obtain the directly usable weight alpha _i The concrete formula is as follows:

S _i ＝∑ _i α _i h _i

The multi-head attention mechanism is that the attention mechanism is repeated for multiple times, namely called multi-head, parameters of each head are not shared, then a plurality of S are spliced, and finally the final output of the multi-head attention mechanism, namely the semantic sequence vector representation of each sentence of an input layer, is obtained through linear transformation.

7. The method for calculating the similarity of Chinese semantics based on the multi-head attention twin Bi-LSTM network according to claim 1, wherein: the output of the multi-head attention mechanism layer is a semantic sequence vector S _a ，S _b Similarity calculation layer mainly calculates S _a And S _b Degree of similarity in semantic space. The invention takes Manhattan space distance as an evaluation standard to calculate the similarity value of two sentences, and the range is [0,1 ]]In the meantime.

similarity＝exp(-|S _a -S _b |)

The output results are greater than 0.5 and are considered similar, and are marked as 1; a score of 0.5 or less is considered to be dissimilar and is labeled 0.

8. The method for calculating the similarity of Chinese semantics based on the multi-head attention twin Bi-LSTM network according to claim 1, wherein: verifying the effectiveness of the MAS-Bi-LSTM model, and comparing the CNN and the RNN common deep learning models selected by the experiment, wherein the CNN and the RNN common deep learning models are TextCNN, GRU, bi-GRU and LSTM models and TextCNN (MA), GRU (MA) and Bi-GRU (MA) models added with a multi-head attention mechanism respectively.

9. The method for calculating the similarity of Chinese semantics based on the multi-head attention twin Bi-LSTM network according to claim 8, wherein: the word embedding layer selects a pre-training word vector based on Chinese Wikipedia, the head number of a multi-head attention mechanism selects 4, and the distance formula selects a Manhattan space distance formula.