CN116662819A

CN116662819A - Short text-oriented matching method and system

Info

Publication number: CN116662819A
Application number: CN202211613205.8A
Authority: CN
Inventors: 蔡华; 陈伟宏
Original assignee: Huayuan Computing Technology Shanghai Co ltd
Current assignee: Huayuan Computing Technology Shanghai Co ltd
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-08-29

Abstract

The invention discloses a short text-oriented matching method and a short text-oriented matching system, comprising the steps of obtaining a pair of texts from a corpus, wherein the texts comprise a plurality of sentences, and copying each sentence once to construct a training set; training the BERT model by the training set to obtain a final BERT model, wherein the training comprises similarity training between two sentences, cosine similarity of each sentence vector and each word embedded vector in the sentences and weight training of the words in the sentences; inputting the word embedding vector into Encoding of a Transformer to obtain a word position feature vector, and fusing the word position feature vector and the sentence vector to obtain a final sentence vector; classifying the sentence vectors obtained by connection by using a classification network to obtain a matching value of the input text pair; the invention improves the training model based on the positive example and the negative example training, so that the similar sentences can obtain higher cosine similarity through the vectors output by the training model, and the most similar texts can be matched more accurately, reasonably and with higher precision.

Description

Short text-oriented matching method and system

Technical Field

The invention relates to the technical field of text matching, in particular to a short text-oriented matching method and a short text-oriented matching system.

Background

Text matching based on sentence vector representation is an important basic problem in natural language processing, and can be applied to a large number of NLP tasks, such as information retrieval, question and answer systems, repeated questions, dialogue systems, machine translation and the like, wherein the NLP tasks can be greatly abstracted into text matching questions, for example, web page searching can be abstracted into a relevance matching question of web pages and user searching entries, automatic question and answer can be abstracted into satisfaction matching questions of candidate answers and questions, and text deduplication can be abstracted into similarity matching questions of texts and texts.

Sentence vector characterization technology is always a popular topic in the NLP field, and after sentence vectors are obtained through sentence vector characterization technology, the similarity of sentences can be calculated or characterized to a certain extent by calculating the cosine similarity between the sentence vectors. While superior sentence vector characterization techniques can match the most similar sentences in the corpus by computing the similarity of the sentences. In the period before BERT, word-ebedding trained by word2vec is generally used for sentence vector representation in combination with a mapping strategy, or textCNN/BiLSTM is used for sentence vector representation in combination with a frame network strategy under the condition of training data. In the BERT era, people generally use [ CLS ] vectors of BERT models as sentence vector characterizations by virtue of inherent advantages of pre-trained language models.

The conventional matching algorithm based on the vocabulary coincidence degree cannot solve the practical problem well, but is actually because of great limitation, the reasons include: 1. word sense limitation: although the words "taxi" and "taxi" are not similar, they are actually the same vehicle; "apple" means in different contexts what is different, either fruit or company; 2. structural limitations: "machine learning" and "learning machine" are different in meaning in terms of the words that are completely overlapped; 3. knowledge limitation: although there are no problems from both lexical and syntactic points of view, it is not right to combine knowledge. This means that for text matching tasks, it is not possible to stay only at the literal matching level, more semantic level matching is required.

For the sentence vector characterization method under the pre-training model, the [ CLS ] vector acquired by the BERT itself already has semantic information of a certain program, which is based on a multi-head attention mechanism of the BERT itself. However, for the task of matching text similarity, token-level matching does not always correctly represent the similarity between texts, or [ CLS ] itself only contains some degree of semantic information, so additional training tasks are required to enhance the semantic representation of [ CLS ].

In general, the problems faced by the prior art are largely divided into two aspects: on the one hand, when an extra training task is added to the sentence vector representation of [ CLS ], a large amount of data may be needed, and the data is generally sentences with a certain relation so as to facilitate the matching of subsequent tasks, thus greatly increasing the cost of the data, and on the other hand, the [ CLS ] itself contains only part of the information of the sentence, and can not necessarily completely express the semantic information of the whole sentence at some time.

Disclosure of Invention

Aiming at the defects existing in the problems, the invention provides a short text-oriented matching method and a short text-oriented matching system.

In order to achieve the above object, the present invention provides a short text-oriented matching method, including:

obtaining a pair of texts from a corpus, wherein the texts comprise a plurality of sentences, and copying each sentence once for constructing a training set;

the training set trains the BERT model to obtain a final BERT model, wherein the training comprises similarity training between two sentences, dot product or cosine similarity of each sentence vector and each word embedded vector in the sentences, and weight training of the words in the sentences;

inputting the word embedding vector into Encoding of a Transformer to obtain the word position feature vector, and fusing the word position feature vector and the sentence vector to obtain a final sentence vector;

and classifying the final sentence vectors obtained by connection by using a classification network to obtain a matching value of the input text pair.

Preferably, the training of the similarity between two sentences includes:

inputting the same sentence into the dropout layers of the BERT model to obtain two output positive examples;

inputting the sentences which are different into the dropout layers which are different in the BERT model to obtain two outputs which are opposite examples;

and respectively calculating the similarity of the positive example and the negative example, and adjusting the BERT model parameters with the aim of increasing the similarity of the positive example and reducing the similarity of the negative example.

Preferably, the increasing of the positive example similarity and the decreasing of the negative example similarity are aimed at, and the formula for adjusting the BERT model parameters is as follows:

wherein: the molecules in the logarithmic function are positive example similarities; the denominator is the counterexample similarity.

Preferably, the training of dot product or cosine similarity of each sentence vector and each word embedded vector in the sentence and weight of the word in the sentence comprises:

inputting the sentence into the BERT model to obtain the sentence vector and each word embedded vector in the sentence, and calculating dot product or cosine similarity between the sentence vector and each word embedded vector;

extracting the weight of each word through the keywords;

converting the dot product or cosine similarity and the weight of the word into probability distribution through softmax operation, and then calculating the KL divergence of the dot product or cosine similarity and the weight of the word;

and adjusting the BERT model parameters aiming at reducing the KL divergence.

Preferably, the KL divergence formula is:

wherein: w (w) _key Is a weight vector of words; w (w) _cls Is a weight vector of the sentence.

The invention also includes a short text oriented matching system comprising:

the acquisition module is used for acquiring a pair of texts from the corpus, wherein the texts comprise a plurality of sentences, and each sentence is copied once and then used for constructing a training set;

the training module is used for training the BERT model by the training set to obtain a final BERT model, wherein the training comprises similarity training between two sentences, dot product or cosine similarity of each sentence vector and each word embedded vector in the sentences, and weight training of the words in the sentences;

the aggregation module is used for embedding the words into Encoding of the vector input Transformer to obtain the word position feature vectors, and fusing the word position feature vectors and the sentence vectors to obtain final sentence vectors;

and the matching module is used for classifying the final sentence vectors obtained by connection by using a classification network to obtain a matching value of the input text pair.

Preferably, the training of the similarity between two sentences includes:

extracting the weight of each word through the keywords;

and adjusting the BERT model parameters aiming at reducing the KL divergence.

Preferably, the KL divergence formula is:

Compared with the prior art, the invention has the beneficial effects that:

the invention improves the vector space of sentence vectors output by the training model based on the positive example and the negative example training, so that similar sentences can obtain higher cosine similarity through the vectors output by the training model, and the most similar texts can be matched more accurately, reasonably and with higher precision.

Drawings

FIG. 1 is a flow chart of the short text oriented matching method of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, the present invention provides a short text-oriented matching method, including:

obtaining a pair of texts from a corpus, wherein the texts comprise a plurality of sentences, and copying each sentence once for constructing a training set, wherein the training set can meet the training requirement without labels;

training the BERT model by the training set to obtain a final BERT model, wherein the training comprises similarity training between two sentences, dot product or cosine similarity of each sentence vector and each word embedded vector in the sentences, and weight training of the words in the sentences;

inputting the word embedding vector into Encoding of a Transformer to obtain a word position feature vector, and fusing the word position feature vector and the sentence vector to obtain a final sentence vector;

and classifying the final sentence vectors obtained by connection by using a classification network to obtain the matching value of the input text pair.

In this embodiment, the similarity training between two sentences, i.e., the contrast learning training, is a self-supervised learning method for learning the general features of the dataset by letting the model learn which data points are similar or different without labels. The key of contrast learning is to construct positive and negative examples, including:

the two outputs obtained by inputting the same sentence into different dropout layers of the BERT model are positive examples;

inputting different sentences into the dropout layers with different BERT models to obtain two outputs which are opposite examples;

and respectively calculating the similarity of the positive example and the negative example, and adjusting BERT model parameters with the aim of increasing the similarity of the positive example and reducing the similarity of the negative example.

The positive example similarity is increased, the negative example similarity is reduced, and the formula for adjusting the BERT model parameters is as follows:

Further, the training of dot product or cosine similarity of each sentence vector and each word embedded vector in the sentence and weight of the word in the sentence comprises:

inputting sentences into a BERT model to obtain sentence vectors and each word embedded vector in the sentences, and calculating dot product or cosine similarity between the sentence vectors and each word embedded vector;

extracting the weight of each word through the keywords;

and adjusting BERT model parameters with the aim of reducing KL divergence.

The KL divergence formula is:

The parameters of the BERT are finely adjusted through the training to obtain a BERT model with stronger text characterization capability, so that sentence vectors output by the BERT have better effect on text matching tasks.

The invention also includes a short text oriented matching system comprising:

the training module is used for training the BERT model by the training set to obtain a final BERT model, wherein the training comprises similarity training between two sentences, dot product or cosine similarity of each sentence vector and each word embedded vector in the sentence, and weight training of the word in the sentence;

the aggregation module is used for inputting word embedding vectors into Encoding of the converters to obtain word position feature vectors, and fusing the word position feature vectors and sentence vectors to obtain final sentence vectors;

and the matching module is used for classifying the sentence vectors obtained by connection by using the classification network to obtain a matching value of the input text pair.

extracting the weight of each word through the keywords;

and adjusting BERT model parameters with the aim of reducing KL divergence.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A short text-oriented matching method, comprising:

2. The short text-oriented matching method according to claim 1, wherein the similarity training between two sentences comprises:

3. The short text-oriented matching method according to claim 2, wherein the increasing of the positive example similarity and the decreasing of the negative example similarity are aimed at, and the formula for adjusting the BERT model parameters is:

4. A short text-oriented matching method according to claim 3, wherein the dot product or cosine similarity of each of the sentence vectors and each of the word embedded vectors in the sentences thereof and the weight training of the words in the sentences comprises:

extracting the weight of each word through the keywords;

and adjusting the BERT model parameters aiming at reducing the KL divergence.

5. The short text-oriented matching method according to claim 4, wherein the KL divergence formula is:

6. A short text-oriented matching system, comprising:

7. The short text oriented matching system of claim 6, wherein similarity training between two of said sentences comprises:

8. The short text-oriented matching system of claim 7, wherein the formulas for increasing the positive example similarity and decreasing the negative example similarity, adjusting the BERT model parameters, are:

9. The short text-oriented matching system of claim 8, wherein the dot product or cosine similarity of each of the sentence vectors and each of the word embedded vectors in the sentences thereof and the weight training of the words in the sentences comprises:

extracting the weight of each word through the keywords;

and adjusting the BERT model parameters aiming at reducing the KL divergence.

10. The short text-oriented matching system of claim 9, wherein the KL-divergence formula is: