CN114492451A

CN114492451A - Text matching method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN114492451A
Application number: CN202111580884.9A
Authority: CN
Inventors: 吕乐宾; 蒋宁; 王洪斌; 吴海英; 权佳成
Original assignee: Mashang Consumer Finance Co Ltd
Current assignee: Mashang Consumer Finance Co Ltd
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-05-13
Anticipated expiration: 2041-12-22
Also published as: CN114492451B

Abstract

The application discloses a text matching method, a text matching device, electronic equipment and a computer-readable storage medium. The method comprises the following steps: inputting a first text and a second text to be matched into a text matching model for text matching processing, and outputting a matching result of the first text and the second text; wherein the text matching model comprises a first interaction layer, a distribution layer and a second interaction layer; the first interaction layer is used for performing cross attention learning on the input first text and the input second text and outputting a first text vector and a second text vector; the distribution layer is used for respectively performing representation learning on the input first text vector and the input second text vector and outputting a third text vector and a fourth text vector; and the second interaction layer is used for splicing the input third text vector and the input fourth text vector to obtain a fifth text vector, calculating text similarity of the fifth text vector and outputting a matching result. Through the method, the accuracy of text matching can be improved.

Description

Text matching method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of text processing technologies, and in particular, to a text matching method and apparatus, an electronic device, and a computer-readable storage medium.

Background

The text matching task is an important research direction in Natural Language Processing (NLP), and plays an important role in tasks such as Information Retrieval (IR), Question Answering (QA), and Paraphrase Recognition (PR). Traditional text matching methods rely on predefined templates and manually extracted rules.

With the development of deep learning, deep neural networks have been widely applied to natural language processing tasks to reduce the cost and time consumed by manually extracting features. The text matching task aims to give two sections of texts Q and D, the similarity value of the two sections of texts is given by extracting semantic information and similarity features existing in the texts, and whether the contents of the two sections of texts belong to similar descriptions can be known through the final similarity value.

At present, the text matching has the problem of insufficient accuracy.

Disclosure of Invention

In order to solve the above problems, the present application provides a text matching method, apparatus, electronic device, and computer-readable storage medium, which can improve accuracy of text matching.

In order to solve the technical problem, the application adopts a technical scheme that: there is provided a text matching method, the method comprising: inputting a first text and a second text to be matched into a text matching model for text matching processing, and outputting a matching result of the first text and the second text; wherein the text matching model comprises a first interaction layer, a distribution layer and a second interaction layer; the first interaction layer is used for performing cross attention learning on the input first text and the input second text and outputting a first text vector and a second text vector; the distribution layer is used for respectively performing representation learning on the input first text vector and the input second text vector and outputting a third text vector and a fourth text vector; and the second interaction layer is used for splicing the input third text vector and the input fourth text vector to obtain a fifth text vector, calculating text similarity of the fifth text vector and outputting a matching result.

In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided a text matching apparatus including: the text matching unit is used for inputting a first text and a second text to be matched into the text matching model for text matching processing and outputting a matching result of the first text and the second text; wherein the text matching model comprises a first interaction layer, a distribution layer and a second interaction layer; the first interaction layer is used for performing cross attention learning on the input first text and the input second text and outputting a first text vector and a second text vector; the distribution layer is used for respectively performing representation learning on the input first text vector and the input second text vector and outputting a third text vector and a fourth text vector; and the second interaction layer is used for splicing the input third text vector and the input fourth text vector to obtain a fifth text vector, calculating text similarity of the fifth text vector and outputting a matching result.

In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided an electronic device comprising a processor and a memory coupled to the processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to implement the method according to the above technical solution.

In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the method as provided in the above solution.

In the application, a first text vector and a second text vector are output by performing cross attention learning on a first text and a second text; respectively performing representation learning on the input first text vector and the input second text vector, and outputting a third text vector and a fourth text vector; and splicing the input third text vector and the input fourth text vector to obtain a fifth text vector, calculating text similarity of the fifth text vector, and outputting a matching result to perform text matching.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Wherein:

FIG. 1 is a schematic flow chart diagram of a first embodiment of a text matching method provided by the present application;

FIG. 2 is a flowchart illustrating a second embodiment of a text matching method provided by the present application;

FIG. 3 is a schematic structural diagram of an embodiment of a first interaction layer provided herein;

FIG. 4 is a schematic diagram of a structure of an embodiment of a distribution layer provided herein;

FIG. 5 is a schematic structural diagram of an embodiment of a second interaction layer provided herein;

FIG. 6 is a schematic structural diagram of an embodiment of a text matching model provided herein;

FIG. 7 is a schematic block diagram of an embodiment of a granular network provided herein;

FIG. 8 is a schematic diagram of the present application in comparison with the related art;

FIG. 9 is another schematic diagram comparing the technical solution of the present application with the related art;

FIG. 10 is another schematic diagram comparing the present embodiment with the related art;

FIG. 11 is another schematic diagram comparing the present embodiment with the related art;

FIG. 12 is a flowchart illustrating a third embodiment of a text matching method provided by the present application;

FIG. 13 is a schematic structural diagram of an embodiment of a text matching apparatus provided in the present application;

FIG. 14 is a schematic structural diagram of an embodiment of an electronic device provided in the present application;

FIG. 15 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to better understand the scheme of the embodiments of the present application, the following first introduces the related terms and concepts that may be involved in the embodiments of the present application.

Text Matching: a deep learning task can be simply understood as calculating the similarity between two sentences and is mainly applied to information retrieval, intelligent question answering and the like.

Convolutional Neural Networks (CNN) are a class of feed forward Neural Networks (fed forward Neural Networks) that contain convolution computations and have a deep structure, and are one of the representative algorithms for deep learning (deep learning).

RNN (recurrent neural network) is an artificial neural network having a tree-like hierarchical structure and in which network nodes recur input information in the order of their connections.

Attention (Attention mechanism): a method for simulating the important information of human being and ignoring the unimportant information. Different weighted values can be distributed to the information coded at different time steps in the input text sequence, and different attention degrees of the model are represented.

Attention (reproduced-attention): from one of attention, calculating a representation vector importance level of each word in a text sentence;

cross-attention (Cross-attention): calculating the importance degree of each word in the sentence A to the whole sentence B; instead, the importance level of each word in sentence B to the entire sentence a can also be calculated.

LSTM (Long Short-Term Memory): the long and short term memory network is a kind of neural network for processing sequence data. Compared with a general neural network, the neural network can process data with sequence variation.

Bi-LSTM (Bi-directional Long Short-Term Memory, bidirectional Long-Term Memory network): a neural network that processes sequence data simultaneously from two directions, improved over LSTM.

Glove: a word embedding method in natural language processing.

Referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of a text matching method provided in the present application. The method comprises the following steps:

step 11: and acquiring a first text and a second text to be matched.

In some embodiments, the first text and the second text may be in the form of a question-and-answer pair, where the first text may be a question and the second text may be an answer. Alternatively, the first text may be an answer and the second text may be a question.

In some embodiments, the first text and the second text may be in the form of an information search, wherein the first text is a search text and the second text is a text to be matched.

In some embodiments, the first text and the second text may be in the form of a repeat recognition, wherein the first text is a first type of expression text and the second text is a second type of expression text.

Step 12: inputting a first text and a second text to be matched into a text matching model for text matching processing, and outputting a matching result of the first text and the second text; wherein the text matching model comprises a first interaction layer, a distribution layer and a second interaction layer; the first interaction layer is used for performing cross attention learning on the input first text and the input second text and outputting a first text vector and a second text vector; the distribution layer is used for respectively performing representation learning on the input first text vector and the input second text vector and outputting a third text vector and a fourth text vector; and the second interaction layer is used for splicing the input third text vector and the input fourth text vector to obtain a fifth text vector, calculating text similarity of the fifth text vector and outputting a matching result.

In some embodiments, the text matching model includes a first interaction layer, a distribution layer, and a second interaction layer.

The first interaction layer is used for performing cross attention learning on the input first text and the input second text and outputting a first text vector and a second text vector.

In some embodiments, in the first interaction layer, extraction of a single word vector, a word vector, and a phrase vector may be performed on the first text to form a first set of phrase vectors. And extracting a single word vector, a word vector and a phrase vector from the second text to form a second phrase vector set. And performing similarity comparison on the first phrase vector set and the second phrase vector set, namely performing cross attention learning to obtain a first text vector and a second text vector, and outputting.

For example, the synonyms between the first and second sets of phrase vectors, the ordering of the synonyms in the original text vector, and the attributes of the synonyms in the text vector, such as subject, predicate, object, fixed, object, or complement, can be associated with each other to determine the cross-attention between the first and second texts, thereby obtaining the first and second text vectors.

The distribution layer is used for respectively performing representation learning on the input first text vector and the input second text vector and outputting a third text vector and a fourth text vector.

The method comprises the steps of performing representation learning on a first text vector and a second text vector which are input, enabling the first text vector and the second text vector to have corresponding representation attention, and further obtaining a corresponding third text vector and a corresponding fourth text vector, wherein the representation attention represents the contribution degree of a word vector in the text vectors to the text vectors. If the word is a subject, a predicate, or an object, the contribution degree is high, and if the word is an anaglyph, the contribution degree is low.

And because the representation attention represents the contribution degree of the words in the text to the text, the first text vector represents a third text vector obtained by learning, and each word in the third text vector has the contribution degree of the word to the text.

Similarly, the second text vector is a fourth text vector obtained through representation learning, and each word in the fourth text vector has its own contribution degree to the text.

And the second interaction layer is used for splicing the input third text vector and the input fourth text vector to obtain a fifth text vector, calculating the text similarity of the fifth text vector and outputting a matching result.

Since the fifth text vector is converted from the first text and the second text, the text similarity calculated from the fifth text vector may represent a matching result between the first text and the second text.

In the embodiment, a first text vector and a second text vector are output by performing cross attention learning on a first text and a second text; respectively performing representation learning on the input first text vector and the input second text vector, and outputting a third text vector and a fourth text vector; and splicing the input third text vector and the input fourth text vector to obtain a fifth text vector, calculating text similarity of the fifth text vector, and outputting a matching result to perform text matching.

Referring to fig. 2, fig. 2 is a schematic flowchart of a second embodiment of the text matching method provided in the present application. The method comprises the following steps:

step 21: and acquiring a first text and a second text to be matched.

Step 22: and inputting the first text and the second text to be matched into a text matching model for text matching processing, and outputting a matching result of the first text and the second text.

Referring to fig. 3, the first interaction layer includes a first embedding layer, a second embedding layer, a similar matrix layer, and a processing layer.

The first embedding layer is used for carrying out word embedding processing on an input first text and outputting a first processed text.

In some embodiments, the first text may be subjected to word embedding processing, i.e., vector conversion, to obtain a corresponding first processed text. For example, the continuous bag-of-words model is used, and the first text is input into the continuous bag-of-words model, so that the continuous bag-of-words model outputs the corresponding vector. Vector conversion can also be implemented using Skip-Gram.

The second embedding layer is used for carrying out word embedding processing on the input second text and outputting a second processed text.

In some embodiments, the second text may be subjected to word embedding processing, i.e., vector conversion, to obtain a corresponding second processed text. For example, the continuous bag-of-words model is used, and the second text is input into the continuous bag-of-words model, so that the continuous bag-of-words model outputs the corresponding vector. Vector conversion can also be implemented using Skip-Gram.

The similarity matrix layer is used for carrying out similarity processing on the input first processed text and the input second processed text and outputting a first weight vector and a second weight vector.

The similarity matrix layer is specifically used for determining a similarity matrix of the first processed text and the second processed text; and performing row normalization processing on the similarity matrix to obtain a first weight vector, and performing column normalization processing on the similarity matrix to obtain a second weight vector.

Because the similarity matrix is a two-dimensional matrix and is distributed in rows and columns, normalization processing is carried out on each row of the similarity matrix to obtain a weight vector corresponding to each row, and then the weight vectors corresponding to each row are summed to obtain a first weight vector. And normalizing each column of the similarity matrix to obtain a weight vector corresponding to each column, and then summing the weight vectors corresponding to each column to obtain a second weight vector.

The processing layer is used for fusing the input second weight vector with the first processing text and outputting a first text vector, and fusing the first weight vector with the second processing text and outputting a second text vector.

Referring to fig. 4, the distribution layer includes a first granular network, a second granular network, a first memory network, a second memory network, a first attention layer, and a second attention layer.

The first granularity network is used for performing multi-granularity extraction on an input first text vector to obtain a plurality of different first granularity information, and splicing the first granularity information to obtain a first spliced vector.

The first granularity network is specifically used for performing feature extraction on an input first text vector by using a plurality of groups of convolution windows with different sizes to obtain a plurality of different first granularity information, and splicing the first granularity information to obtain a first spliced vector.

Because the scales corresponding to the granularity information are different, more characteristic information can be acquired.

The first memory network is used for extracting features of the input first splicing vector and outputting the extracted first feature vector.

The first attention layer is used for performing representation learning on the input first feature vector and outputting a third text vector.

The second granularity network is used for performing multi-granularity extraction on the input second text vector to obtain a plurality of different second granularity information, and splicing the second granularity information to obtain a second spliced vector.

The second granularity network is specifically used for extracting features of input second text vectors by using multiple groups of convolution windows with different sizes to obtain multiple different second granularity information, and splicing the second granularity information to obtain a second spliced vector.

The second memory network is used for extracting the features of the input second splicing vector and outputting the extracted second feature vector.

And the second attention layer is used for performing representation learning on the input second feature vector and outputting a fourth text vector.

Referring to fig. 5, the second interactive layer includes a splice layer and a full link layer.

And the splicing layer is used for splicing the input third text vector and the input fourth text vector to obtain a fifth text vector.

And the full connection layer is used for performing text similarity calculation on the input fifth text vector and outputting a matching result.

In an application scenario, two texts to be matched, such as a first text and a second text, are respectively obtained as described with reference to fig. 6 and fig. 7. And then carrying out word embedding operation on the first text in the first embedding layer, and carrying out word embedding operation on the second text in the second embedding layer.

For example, an embedding search function is set in the first embedding layer and the second embedding layer, the function obtains a search matrix by using a pre-trained Glove word vector, and each word of the first text and the second text is mapped to a high-dimensional vector space to obtain a corresponding word embedding text.

And then calculating the similarity between each word of the two word embedded texts in the similarity matrix layer to obtain a similarity matrix, then carrying out normalization processing on the similarity matrix according to columns, and summing all the columns to obtain a second weight vector.

And respectively carrying out normalization processing on the similar matrixes according to rows and summing all the rows to obtain a first weight vector.

And then multiplying the second weight vector by the word embedded text output by the first embedded layer in the first processing layer to realize the weighting processing of the word embedded text corresponding to the first text to obtain the first text vector.

And multiplying the first weight vector by the word embedded text output by the second embedded layer at the second processing layer to realize the weighting processing of the word embedded text corresponding to the second text to obtain a second text vector.

Specifically, the similarity between each pair of words in the first text and the second text is calculated by using word vectors after word embedding is performed on the first text and the second text to obtain a similarity matrix, the similarity matrix is normalized and summed according to rows and columns to obtain attention weights of the first text and the second text respectively, and the attention weights are used for weighting the original word embedded text to obtain new texts, such as the first text vector and the second text vector. The weighted first text vector and the weighted second text vector are easier to grasp the key parts of a piece of text in the processes of information extraction and text representation.

In the text matching process, the contribution degree of words in the text to the matching task is different, and different weight information needs to be given to different words in order to better play the role of important words in the text representation process, so the application introduces an attention mechanism, and adds attention weights from the other party to the first text and the second text respectively. This cross-attention consists of the attention of the first text to the second text (Q2D) and the attention of the second text to the first text (D2Q), and each weight value in the vector indirectly represents the overall importance of each word in the text to all words in another piece of text.

The description is made in conjunction with the following:

suppose that first texts Q each having a text length size X are given^x＝{Q¹,…,Q^XAnd a second text D of text length size Y^y＝{D¹,…,D^YLet M_xyRepresenting the similarity matrix after the interaction of the first text and the second text, the representation of attention can be calculated as follows:

M_xy＝Linear(Q^x·D^y+bias)；

A_Q2D＝sum_col(σ_col(M_xy)·D^y)；

A_D2Q＝sum_row(σ_row(M_xy)·Q^x)；

where bias represents the bias added after the linear function,. represents the dot product operation between tensors,. sigma represents the softmax activation function, sum (-) represents the sum of tensors computed along the specified axis, A_D2QThe attention vector representing the first text obtained, i.e. the first weight vector mentioned above, i.e. the attention of the first text to the second text, a_Q2DAnd then combining the two attention vectors and the text with the embedded words in the corresponding first processing layer or second processing layer to obtain a new weighted text representation, and calculating as follows:

Q_ATT＝Q·A_Q2D；

D_ATT＝D·A_D2Q；

Q_ATTand D_ATTRespectively represent the weighted word-embedded text, wherein Q_ATTThe first text vector, D, which can represent the above embodiment_ATTA second text vector representing the above embodiment.

Then, multi-granularity information extraction is performed on the first text vector by using a first granularity network, and multi-granularity information extraction is performed on the second text vector by using a second granularity network. And combining the obtained multiple pieces of granularity information.

Specifically, the following description is made with reference to fig. 7:

in fig. 7, the weighted word-embedded text is convolved in groups with three different sets of convolution windows, each set of convolution windows extracting a feature representation of a different granularity. For example, convolution window n1 and convolution window n2 are in one set, convolution window m1 and convolution window m2 are in one set, and convolution window p1 and convolution window p2 are in one set.

The triplet information of the convolution window n1 is (100, 1, 8), the triplet information of the convolution window n2 is (8, 1, 96), the triplet information of the convolution window m1 is (100, 1, 8), the triplet information of the convolution window m2 is (8, 2, 96), the triplet information of the convolution window p1 is (100, 1, 8), and the triplet information of the convolution window p2 is (8, 3, 96).

Wherein, the triplet information respectively represents: inputting the characteristic dimension, the convolution kernel size and outputting the characteristic dimension.

Specifically, the calculation formula is as follows:

where i, j ∈ {1,2,3},

the original text vector in the first layer of the representation granularity network is represented by tensor after information extraction and dimensionality reduction, namely the output of a convolution window n1, a convolution window p1 and a convolution window m1,

respectively representing the results of granularity information extraction and dimension enlargement in the second layer, namely the output of a convolution window n2, a convolution window p2 and a convolution window m2, W_multiAnd representing a granularity sliding window, and gradually extracting granularity information in the text sequence along with the sliding of the window, wherein sigma represents a RELU activation function. The last step of the granularity network adopts the residual error connection operation in the ResNeXt network, the information after the characteristic extraction is connected with the original information in a connection layer, and the calculation mode is shown as the following formula:

where i, j ∈ {1,2,3}, concat (·) denotes the stitching operation of the tensor, axis denotes the axis parameters,

and respectively representing the text information of each granularity in the first text vector and the second text vector after residual error connection, namely the output of the corresponding connection layer. Finally, the granularity network splices all the granularity information according to rows to obtain the expression tensor Q of the first splicing vector and the second splicing vector_all，D_allThe formula is calculated as follows:

and then, inputting the multi-granularity information obtained by combining the first text vector and the second text vector into a first memory network and a second memory network respectively, performing full-text semantic learning, calculating the contribution degree of the granularity information of each word to full-text semantics to obtain a weight vector, namely representing attention, and combining the representing attention with the multi-granularity information to realize attention to different granularity information.

That is, a first stitching vector is input to a first memory network, and a second stitching vector is input to a second memory network.

The first Memory network and the second Memory network may be Bi-directional Long Short-Term Memory (Bi-LSTM).

Specifically, a Bi-LSTM network is respectively adopted for learning and representing and dimension compressing text features of multiple granularities for the spliced first splicing vector and the spliced second splicing vector. The Bi-LSTM network can realize the feature expression of the sequence at a high level in a more abstract way, so that the global information of the sequence can be better grasped, and the method is not limited to extracting the similarity features between words or phrases. Therefore, the Bi-LSTM network performs representation learning on the information after feature extraction, and can acquire global information in each granularity and global information between the granularities. Specifically, the output of the Bi-LSTM network can be expressed using the following equation:

Q_rep＝Bi-LSTM(Q_all)；

D_rep＝Bi-LSTM(D_all)。

wherein Q is_repRepresenting the output of the first memory network, corresponding to the first feature vector, D_repThe output of the second memory network is represented, corresponding to the second feature vector. Wherein, because the word group information extracted by the first granularity network and the second granularity network has different importance degrees to the matching task, the word group information with low importance degree may become noise for semantic understanding, a corresponding attention layer is added behind the Bi-LSTM network, such as a first attention layer and a second attention layer, the attention layer uses an attention representing mechanism, calculates the importance degree of each time step of the representation information to the global information by a full connection layer, then uses the importance degree as the weight value of the representation information, and adds a weight approximation to the representation informationAnd the method can better play the role of the phrase information with high importance degree when two sentences interact, and inhibit the role of the phrase information with low importance degree. The vector weights in the first and second attention levels are expressed using the following formula:

wherein i ∈ {1,2, … X }, j ∈ {1,2, … Y }, X and Y respectively represent the lengths of the first eigenvector and the second eigenvector,

vector representations of the i-th and j-th time steps, W, of two pieces of input text, respectivelyⁱ，W^jRepresents a learnable parameter, σ (-) represents a sigmoid activation function,

and respectively representing the weight values of the ith and j time steps of the first feature vector and the second feature vector, namely the vector weight.

Then, the weighting value is used for carrying out weighting processing on the input first characteristic vector and the input second characteristic vector to obtain the final output Q_out，D_out. The outputs in the first and second attention layers are represented using the following formulas:

Q_out＝Q_{rep_att}·Q_rep；

D_out＝D_{rep_att}·D_rep。

wherein Q is_outThe output representing the first layer of attention, i.e. the third text vector, D, described above_outThe output of the second attention layer, i.e. the fourth text vector described above, is represented.

And then combining the two processed text vectors in a splicing layer to obtain a fifth text vector, performing text similarity calculation on the input fifth text vector through a full-connection layer, and outputting a matching result. .

Specifically, the represented third text vector and the fourth text vector are tiled and connected in the first dimension, and the tiled third text vector and the fourth text vector are input into a full connection layer of the neural network to calculate the matching score of the two text vectors. Specifically, the following formula is adopted:

Z_rep＝concat(Q_out,D_out,axis＝-1)；

Score＝σ(WZ_rep+b)。

wherein Q is_out，D_outThe sequences are respectively weighted and expressed by the first text and the second text, namely a third text vector and a fourth text vector, concat (-) represents a splicing function, Z_repDenotes the tensor after stitching, W denotes the learnable parameter, σ (-) denotes the Linear activation function, and Score is the matching Score of the two pieces of text that are finally output.

The matching score at this time may indicate a degree of matching of the first text and the second text.

In other embodiments, the text matching model may be trained in the above manner, and after the matching score is obtained, the weight of the entire text matching model is updated by a loss function according to the deviation between the matching score and the actual value.

For example, during training, the loss function is calculated as follows:

Loss＝max(0,margin+y′-y)。

for a given two-segment input sequence, the difference between the final correct prediction score y and the incorrect prediction score y' can be used to represent the similarity relationship between the two prediction results, and margin is a coefficient given by itself. The higher y is, the lower y 'is, namely the larger y-y' is, the better the text matching model performs, but the difference between the scores is at most margin, and the larger the difference is, no more prize is paid.

In an application scenario, the technical solution of the present application is tested, and is explained with reference to fig. 8 to fig. 11:

the experiment of the application uses a WikiQA data set of Microsoft, an SNLI data set of Stanford and a semeval2016-task3 data set based on tweet to respectively carry out experimental comparison on three text matching tasks of question answering, text implication recognition and gesture detection.

WikiQA is a publicly available data set of open domain problem solutions containing 3047 questions and 29258 answers extracted from the query log of Bing, of which 1473 sentences are labeled as answers to the corresponding questions. Each question is associated with multiple answers to Wikipedia, according to the clicking behavior of the user, and the total number of questions and answers is 29258. Then, answers to the correct questions that are manually labeled are taken as candidate answers, and thus 1473 sentences are labeled as correct answers. The training set includes 20K pairs of sentences, the test set includes 6.1K pairs of sentences, the verification set includes 2.7K pairs of sentences, the query sentence includes 6.89 words on average, and the document sentence includes 22.73 words on average.

The SNLI dataset of Stanford is a dataset published by Stanford university for text inclusion recognition tasks in natural language processing. The SNLI data set is marked manually and comprises 570K text pairs, wherein the training set 550K, the verification set 10K and the test set 10K comprise three types of text pairs: intailment, conversation, neutral. Query sentences in the SNLI dataset contain 12.85 words on average, and document sentences contain 7.41 words.

SemEval-2016 Task3 contains two subtasks, "question-answer similarity" and "question-question similarity", and the experiment performed a comparison of the experimental results on the "question-answer similarity" data. In the "question-answer similarity" task, a particular question is presented, and then answers are ranked according to relevance to the question. In the pose detection task, the goal is to determine the preference for a given (pre-selected) target, which may not be an opinion target in the original text, so a deeper understanding and reasoning ability of the model for the sentence is required. In the experiment, the 'external answer' is selected as an alternative answer of the task, the label is divided into three results of Good, Potentiality Useful and Bad, the query sentence of the data set averagely comprises 39.29 words, and the document sentence comprises 36.85 words.

Fig. 8 shows the comparison of the experimental results of each mainstream model and the technical solution of the present application on the question-and-answer dataset of WikiQA. FMMI is used for representing the technical scheme of the application. Compared with a text implication recognition data set, the question-answer data set depends on the understanding of the semantics of the text data at multiple levels by the model, so that the model capable of extracting the text semantics better shows a better effect on the data set. The FMMI of the application has the advantages that the test results in three indexes of NDCG @3, NDCG @5 and MAP are superior to those of other models. It can be seen that FMMI can capture semantic information at a higher level by extracting and weighting multiple granularities of information than by directly capturing single granularity of similarity information between words.

Fig. 9 illustrates experimental comparisons of various mainstream models with FMMI of the present application on the SNLI dataset, and fig. 9 illustrates training procedures of various models on the SNLI question-answer dataset. Compared with a question-answer data set, the text implication identification task depends on the extraction of the whole text semantic information by the model and the acquisition of the local text characteristic information by the model, so that the model capable of better extracting the characteristics is not good in the WikiQA data set, but shows good experimental effects on the SNLI data set, such as MatchPyramid, DUET, CONV _ KNRM and other models. Since the FMMI can capture high-level text semantic information and can focus on local feature information better, the FMMI can show the optimal effect on the SNLI data set.

FIG. 10 shows the experimental comparison of each model with the FMMI model on top of the semeval2016-task3 dataset, and FIG. 10 shows the training process of each model on top of the semeval2016-task3 dataset. The data set selects an external answer (namely, a text to be matched is not explicitly mentioned in the original text) as a candidate target, so that the model is difficult to judge whether the text has relevance by comparing simple characteristics of two sentences, and more experimenting the understanding and reasoning capability of the model on the original text, so that the model based on the representation is possibly more competent for the task, such as ARC-II, Mv _ LSTM and the like. The FMMI of the present application is still superior to other matching models above this data set, so it can be demonstrated that FMMI can better determine the preference of statements in the original text for a given target in the pose detection task than other models.

FIG. 11 shows the performance of FMMI on WikiQA data set with one necessary processing unit removed, FMMI-IA, FMMI-LSTM, FMMI-RA, FMMI-Granet, representing the initial model after the interactive attention unit is removed, the Bi-LSTM representing model, the attention unit, and the granular network unit, respectively. The degree of contribution of each unit to the overall model can be seen from the experimental results. The maximum contribution degree is an interactive attention mechanism quoted in the front stage of the model, and the weighting processing of each word of the original text effectively improves the effect of information extraction in the rear stage of the model and the text representation capability; secondly, a Bi-LSTM unit is used for text representation, so that the Bi-LSTM unit plays an important role in semantic representation and extraction processes; secondly, attention is expressed, and the information after granularity extraction is weighted, so that the model pays attention to useful phrase information; and finally, the network is a multi-granularity network, and the accuracy of text matching is improved to a certain extent by acquiring more granularity information.

Referring to fig. 12, fig. 12 is a schematic flowchart of a text matching method according to a third embodiment of the present application. The method comprises the following steps:

step 121: and acquiring a first training text and a second training text.

Wherein the first training text and the second training text are marked with the similarity true value.

Step 122: and inputting the first training text and the second training text into a text matching model, and outputting a similarity output value of the first training text and the second training text, wherein the similarity output value is used as a matching result.

Step 123: and determining a loss function based on the deviation of the similarity output value and the similarity true value.

Step 124: and modifying the text matching model by using a loss function.

For example, during training, the loss function is calculated as follows:

Loss＝max(0,margin+y′-y)。

for a given two-segment input sequence, the difference between the final correct prediction score y and the incorrect prediction score y' can be used to represent the similarity relationship between the two prediction results, and margin is a coefficient given by itself. The higher y, the lower y ', i.e. the larger y-y', represents that the text matching model performs better, but the difference between the scores is at most margin, and the larger difference is not more rewarded.

Referring to fig. 13, fig. 13 is a schematic structural diagram of an embodiment of a text matching apparatus provided in the present application. The text matching apparatus 130 includes a text matching unit 131.

The text matching unit 131 is configured to input the first text and the second text to be matched into the text matching model for text matching processing, and output a matching result of the first text and the second text.

Wherein the text matching model comprises a first interaction layer, a distribution layer and a second interaction layer;

the first interaction layer is used for performing cross attention learning on the input first text and the input second text and outputting a first text vector and a second text vector;

the distribution layer is used for respectively performing representation learning on the input first text vector and the input second text vector and outputting a third text vector and a fourth text vector;

and the second interaction layer is used for splicing the input third text vector and the input fourth text vector to obtain a fifth text vector, calculating text similarity of the fifth text vector and outputting a matching result.

It can be understood that the text matching apparatus 130 is further configured to implement the method according to any of the embodiments, and please refer to any of the above technical solutions specifically, which is not described herein again.

Referring to fig. 14, fig. 14 is a schematic structural diagram of an embodiment of an electronic device provided in the present application. The electronic device 140 comprises a processor 141 and a memory 142 coupled to the processor 141, wherein the memory 142 stores a computer program, and the processor 141 is configured to execute the computer program to implement the following method:

inputting a first text and a second text to be matched into a text matching model for text matching processing, and outputting a matching result of the first text and the second text; wherein the text matching model comprises a first interaction layer, a distribution layer and a second interaction layer; the first interaction layer is used for performing cross attention learning on the input first text and the input second text and outputting a first text vector and a second text vector; the distribution layer is used for respectively performing representation learning on the input first text vector and the input second text vector and outputting a third text vector and a fourth text vector; and the second interaction layer is used for splicing the input third text vector and the input fourth text vector to obtain a fifth text vector, calculating text similarity of the fifth text vector and outputting a matching result.

It can be understood that the processor 141 is further configured to execute a computer program to implement the method according to any of the above embodiments, which is specifically referred to any of the above technical solutions and is not described herein again.

Referring to fig. 15, fig. 15 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application. The computer-readable storage medium 150 stores a computer program 151, the computer program 151, when executed by a processor, implementing the method of:

It can be understood that, when being executed by a processor, the computer program 151 is further configured to implement the method according to any of the embodiments, which is specifically referred to any of the above technical solutions and is not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above description is only for the purpose of illustrating embodiments of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made according to the content of the present specification and the accompanying drawings, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method of text matching, the method comprising:

inputting a first text and a second text to be matched into a text matching model for text matching processing, and outputting a matching result of the first text and the second text;

wherein the text matching model comprises a first interaction layer, a distribution layer, and a second interaction layer;

the distribution layer is used for respectively performing representation learning on the first text vector and the second text vector which are input and outputting a third text vector and a fourth text vector;

and the second interaction layer is used for splicing the input third text vector and the input fourth text vector to obtain a fifth text vector, calculating text similarity of the fifth text vector and outputting the matching result.

2. The method of claim 1, wherein the first interaction layer comprises: a first embedding layer, a second embedding layer, a similar matrix layer and a processing layer;

the first embedding layer is used for carrying out word embedding processing on the input first text and outputting a first processed text;

the second embedding layer is used for carrying out word embedding processing on the input second text and outputting a second processed text;

the similarity matrix layer is used for carrying out similarity processing on the input first processed text and the input second processed text and outputting a first weight vector and a second weight vector;

and the processing layer is used for fusing the input second weight vector with the first processing text and outputting a first text vector, and fusing the first weight vector with the second processing text and outputting a second text vector.

3. The method according to claim 2, characterized in that the similarity matrix layer is specifically configured to:

determining a similarity matrix of the first processed text and the second processed text;

and performing row normalization processing on the similarity matrix to obtain the first weight vector, and performing column normalization processing on the similarity matrix to obtain the second weight vector.

4. The method of claim 1, wherein the distribution layer comprises: a first granular network, a second granular network, a first memory network, a second memory network, a first attention layer, and a second attention layer;

the first granularity network is used for performing multi-granularity extraction on the input first text vector to obtain a plurality of different first granularity information, and splicing the first granularity information to obtain a first spliced vector;

the first memory network is used for extracting features of the input first splicing vector and outputting an extracted first feature vector;

a first attention layer for performing representation learning on the input first feature vector and outputting the third text vector;

the second granularity network is used for performing multi-granularity extraction on the input second text vector to obtain a plurality of different second granularity information, and splicing the second granularity information to obtain a second spliced vector;

the second memory network is used for extracting features of the input second spliced vector and outputting an extracted second feature vector;

and the second attention layer is used for performing representation learning on the input second feature vector and outputting the fourth text vector.

5. The method of claim 4, wherein the first granular network is specifically configured to:

performing feature extraction on the input first text vector by utilizing a plurality of groups of convolution windows with different sizes to obtain a plurality of different first granularity information, and splicing the first granularity information to obtain a first spliced vector;

the second granular network is specifically configured to:

and performing feature extraction on the input second text vector by utilizing a plurality of groups of convolution windows with different sizes to obtain a plurality of different second granularity information, and splicing the second granularity information to obtain a second spliced vector.

6. The method of claim 1, wherein the second interaction layer comprises: splicing layers and full connecting layers;

the splicing layer is used for splicing the input third text vector and the input fourth text vector to obtain a fifth text vector;

and the full connection layer is used for performing text similarity calculation on the input fifth text vector and outputting the matching result.

7. The method of claim 1, further comprising:

acquiring a first training text and a second training text;

inputting the first training text and the second training text into the text matching model, and outputting a similarity output value of the first training text and the second training text, wherein the similarity output value is used as the matching result;

determining a loss function based on the deviation between the similarity output value and the similarity true value;

and correcting the text matching model by using the loss function.

8. A text matching apparatus, characterized in that the text matching apparatus comprises:

the text matching unit is used for inputting a first text and a second text to be matched into a text matching model for text matching processing and outputting a matching result of the first text and the second text;

9. An electronic device, characterized in that the electronic device comprises a processor and a memory coupled to the processor, in which memory a computer program is stored, the processor being adapted to execute the computer program to implement the method according to any of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.