CN116662819A - Short text-oriented matching method and system - Google Patents

Short text-oriented matching method and system Download PDF

Info

Publication number
CN116662819A
CN116662819A CN202211613205.8A CN202211613205A CN116662819A CN 116662819 A CN116662819 A CN 116662819A CN 202211613205 A CN202211613205 A CN 202211613205A CN 116662819 A CN116662819 A CN 116662819A
Authority
CN
China
Prior art keywords
sentence
vector
similarity
training
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211613205.8A
Other languages
Chinese (zh)
Inventor
蔡华
陈伟宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huayuan Computing Technology Shanghai Co ltd
Original Assignee
Huayuan Computing Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huayuan Computing Technology Shanghai Co ltd filed Critical Huayuan Computing Technology Shanghai Co ltd
Priority to CN202211613205.8A priority Critical patent/CN116662819A/en
Publication of CN116662819A publication Critical patent/CN116662819A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a short text-oriented matching method and a short text-oriented matching system, comprising the steps of obtaining a pair of texts from a corpus, wherein the texts comprise a plurality of sentences, and copying each sentence once to construct a training set; training the BERT model by the training set to obtain a final BERT model, wherein the training comprises similarity training between two sentences, cosine similarity of each sentence vector and each word embedded vector in the sentences and weight training of the words in the sentences; inputting the word embedding vector into Encoding of a Transformer to obtain a word position feature vector, and fusing the word position feature vector and the sentence vector to obtain a final sentence vector; classifying the sentence vectors obtained by connection by using a classification network to obtain a matching value of the input text pair; the invention improves the training model based on the positive example and the negative example training, so that the similar sentences can obtain higher cosine similarity through the vectors output by the training model, and the most similar texts can be matched more accurately, reasonably and with higher precision.

Description

Short text-oriented matching method and system
Technical Field
The invention relates to the technical field of text matching, in particular to a short text-oriented matching method and a short text-oriented matching system.
Background
Text matching based on sentence vector representation is an important basic problem in natural language processing, and can be applied to a large number of NLP tasks, such as information retrieval, question and answer systems, repeated questions, dialogue systems, machine translation and the like, wherein the NLP tasks can be greatly abstracted into text matching questions, for example, web page searching can be abstracted into a relevance matching question of web pages and user searching entries, automatic question and answer can be abstracted into satisfaction matching questions of candidate answers and questions, and text deduplication can be abstracted into similarity matching questions of texts and texts.
Sentence vector characterization technology is always a popular topic in the NLP field, and after sentence vectors are obtained through sentence vector characterization technology, the similarity of sentences can be calculated or characterized to a certain extent by calculating the cosine similarity between the sentence vectors. While superior sentence vector characterization techniques can match the most similar sentences in the corpus by computing the similarity of the sentences. In the period before BERT, word-ebedding trained by word2vec is generally used for sentence vector representation in combination with a mapping strategy, or textCNN/BiLSTM is used for sentence vector representation in combination with a frame network strategy under the condition of training data. In the BERT era, people generally use [ CLS ] vectors of BERT models as sentence vector characterizations by virtue of inherent advantages of pre-trained language models.
The conventional matching algorithm based on the vocabulary coincidence degree cannot solve the practical problem well, but is actually because of great limitation, the reasons include: 1. word sense limitation: although the words "taxi" and "taxi" are not similar, they are actually the same vehicle; "apple" means in different contexts what is different, either fruit or company; 2. structural limitations: "machine learning" and "learning machine" are different in meaning in terms of the words that are completely overlapped; 3. knowledge limitation: although there are no problems from both lexical and syntactic points of view, it is not right to combine knowledge. This means that for text matching tasks, it is not possible to stay only at the literal matching level, more semantic level matching is required.
For the sentence vector characterization method under the pre-training model, the [ CLS ] vector acquired by the BERT itself already has semantic information of a certain program, which is based on a multi-head attention mechanism of the BERT itself. However, for the task of matching text similarity, token-level matching does not always correctly represent the similarity between texts, or [ CLS ] itself only contains some degree of semantic information, so additional training tasks are required to enhance the semantic representation of [ CLS ].
In general, the problems faced by the prior art are largely divided into two aspects: on the one hand, when an extra training task is added to the sentence vector representation of [ CLS ], a large amount of data may be needed, and the data is generally sentences with a certain relation so as to facilitate the matching of subsequent tasks, thus greatly increasing the cost of the data, and on the other hand, the [ CLS ] itself contains only part of the information of the sentence, and can not necessarily completely express the semantic information of the whole sentence at some time.
Disclosure of Invention
Aiming at the defects existing in the problems, the invention provides a short text-oriented matching method and a short text-oriented matching system.
In order to achieve the above object, the present invention provides a short text-oriented matching method, including:
obtaining a pair of texts from a corpus, wherein the texts comprise a plurality of sentences, and copying each sentence once for constructing a training set;
the training set trains the BERT model to obtain a final BERT model, wherein the training comprises similarity training between two sentences, dot product or cosine similarity of each sentence vector and each word embedded vector in the sentences, and weight training of the words in the sentences;
inputting the word embedding vector into Encoding of a Transformer to obtain the word position feature vector, and fusing the word position feature vector and the sentence vector to obtain a final sentence vector;
and classifying the final sentence vectors obtained by connection by using a classification network to obtain a matching value of the input text pair.
Preferably, the training of the similarity between two sentences includes:
inputting the same sentence into the dropout layers of the BERT model to obtain two output positive examples;
inputting the sentences which are different into the dropout layers which are different in the BERT model to obtain two outputs which are opposite examples;
and respectively calculating the similarity of the positive example and the negative example, and adjusting the BERT model parameters with the aim of increasing the similarity of the positive example and reducing the similarity of the negative example.
Preferably, the increasing of the positive example similarity and the decreasing of the negative example similarity are aimed at, and the formula for adjusting the BERT model parameters is as follows:
wherein: the molecules in the logarithmic function are positive example similarities; the denominator is the counterexample similarity.
Preferably, the training of dot product or cosine similarity of each sentence vector and each word embedded vector in the sentence and weight of the word in the sentence comprises:
inputting the sentence into the BERT model to obtain the sentence vector and each word embedded vector in the sentence, and calculating dot product or cosine similarity between the sentence vector and each word embedded vector;
extracting the weight of each word through the keywords;
converting the dot product or cosine similarity and the weight of the word into probability distribution through softmax operation, and then calculating the KL divergence of the dot product or cosine similarity and the weight of the word;
and adjusting the BERT model parameters aiming at reducing the KL divergence.
Preferably, the KL divergence formula is:
wherein: w (w) key Is a weight vector of words; w (w) cls Is a weight vector of the sentence.
The invention also includes a short text oriented matching system comprising:
the acquisition module is used for acquiring a pair of texts from the corpus, wherein the texts comprise a plurality of sentences, and each sentence is copied once and then used for constructing a training set;
the training module is used for training the BERT model by the training set to obtain a final BERT model, wherein the training comprises similarity training between two sentences, dot product or cosine similarity of each sentence vector and each word embedded vector in the sentences, and weight training of the words in the sentences;
the aggregation module is used for embedding the words into Encoding of the vector input Transformer to obtain the word position feature vectors, and fusing the word position feature vectors and the sentence vectors to obtain final sentence vectors;
and the matching module is used for classifying the final sentence vectors obtained by connection by using a classification network to obtain a matching value of the input text pair.
Preferably, the training of the similarity between two sentences includes:
inputting the same sentence into the dropout layers of the BERT model to obtain two output positive examples;
inputting the sentences which are different into the dropout layers which are different in the BERT model to obtain two outputs which are opposite examples;
and respectively calculating the similarity of the positive example and the negative example, and adjusting the BERT model parameters with the aim of increasing the similarity of the positive example and reducing the similarity of the negative example.
Preferably, the increasing of the positive example similarity and the decreasing of the negative example similarity are aimed at, and the formula for adjusting the BERT model parameters is as follows:
wherein: the molecules in the logarithmic function are positive example similarities; the denominator is the counterexample similarity.
Preferably, the training of dot product or cosine similarity of each sentence vector and each word embedded vector in the sentence and weight of the word in the sentence comprises:
inputting the sentence into the BERT model to obtain the sentence vector and each word embedded vector in the sentence, and calculating dot product or cosine similarity between the sentence vector and each word embedded vector;
extracting the weight of each word through the keywords;
converting the dot product or cosine similarity and the weight of the word into probability distribution through softmax operation, and then calculating the KL divergence of the dot product or cosine similarity and the weight of the word;
and adjusting the BERT model parameters aiming at reducing the KL divergence.
Preferably, the KL divergence formula is:
wherein: w (w) key Is a weight vector of words; w (w) cls Is a weight vector of the sentence.
Compared with the prior art, the invention has the beneficial effects that:
the invention improves the vector space of sentence vectors output by the training model based on the positive example and the negative example training, so that similar sentences can obtain higher cosine similarity through the vectors output by the training model, and the most similar texts can be matched more accurately, reasonably and with higher precision.
Drawings
FIG. 1 is a flow chart of the short text oriented matching method of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, the present invention provides a short text-oriented matching method, including:
obtaining a pair of texts from a corpus, wherein the texts comprise a plurality of sentences, and copying each sentence once for constructing a training set, wherein the training set can meet the training requirement without labels;
training the BERT model by the training set to obtain a final BERT model, wherein the training comprises similarity training between two sentences, dot product or cosine similarity of each sentence vector and each word embedded vector in the sentences, and weight training of the words in the sentences;
inputting the word embedding vector into Encoding of a Transformer to obtain a word position feature vector, and fusing the word position feature vector and the sentence vector to obtain a final sentence vector;
and classifying the final sentence vectors obtained by connection by using a classification network to obtain the matching value of the input text pair.
In this embodiment, the similarity training between two sentences, i.e., the contrast learning training, is a self-supervised learning method for learning the general features of the dataset by letting the model learn which data points are similar or different without labels. The key of contrast learning is to construct positive and negative examples, including:
the two outputs obtained by inputting the same sentence into different dropout layers of the BERT model are positive examples;
inputting different sentences into the dropout layers with different BERT models to obtain two outputs which are opposite examples;
and respectively calculating the similarity of the positive example and the negative example, and adjusting BERT model parameters with the aim of increasing the similarity of the positive example and reducing the similarity of the negative example.
The positive example similarity is increased, the negative example similarity is reduced, and the formula for adjusting the BERT model parameters is as follows:
wherein: the molecules in the logarithmic function are positive example similarities; the denominator is the counterexample similarity.
Further, the training of dot product or cosine similarity of each sentence vector and each word embedded vector in the sentence and weight of the word in the sentence comprises:
inputting sentences into a BERT model to obtain sentence vectors and each word embedded vector in the sentences, and calculating dot product or cosine similarity between the sentence vectors and each word embedded vector;
extracting the weight of each word through the keywords;
converting the dot product or cosine similarity and the weight of the word into probability distribution through softmax operation, and then calculating the KL divergence of the dot product or cosine similarity and the weight of the word;
and adjusting BERT model parameters with the aim of reducing KL divergence.
The KL divergence formula is:
wherein: w (w) key Is a weight vector of words; w (w) cls Is a weight vector of the sentence.
The parameters of the BERT are finely adjusted through the training to obtain a BERT model with stronger text characterization capability, so that sentence vectors output by the BERT have better effect on text matching tasks.
The invention also includes a short text oriented matching system comprising:
the acquisition module is used for acquiring a pair of texts from the corpus, wherein the texts comprise a plurality of sentences, and each sentence is copied once and then used for constructing a training set;
the training module is used for training the BERT model by the training set to obtain a final BERT model, wherein the training comprises similarity training between two sentences, dot product or cosine similarity of each sentence vector and each word embedded vector in the sentence, and weight training of the word in the sentence;
the aggregation module is used for inputting word embedding vectors into Encoding of the converters to obtain word position feature vectors, and fusing the word position feature vectors and sentence vectors to obtain final sentence vectors;
and the matching module is used for classifying the sentence vectors obtained by connection by using the classification network to obtain a matching value of the input text pair.
In this embodiment, the similarity training between two sentences, i.e., the contrast learning training, is a self-supervised learning method for learning the general features of the dataset by letting the model learn which data points are similar or different without labels. The key of contrast learning is to construct positive and negative examples, including:
the two outputs obtained by inputting the same sentence into different dropout layers of the BERT model are positive examples;
inputting different sentences into the dropout layers with different BERT models to obtain two outputs which are opposite examples;
and respectively calculating the similarity of the positive example and the negative example, and adjusting BERT model parameters with the aim of increasing the similarity of the positive example and reducing the similarity of the negative example.
Further, the training of dot product or cosine similarity of each sentence vector and each word embedded vector in the sentence and weight of the word in the sentence comprises:
inputting sentences into a BERT model to obtain sentence vectors and each word embedded vector in the sentences, and calculating dot product or cosine similarity between the sentence vectors and each word embedded vector;
extracting the weight of each word through the keywords;
converting the dot product or cosine similarity and the weight of the word into probability distribution through softmax operation, and then calculating the KL divergence of the dot product or cosine similarity and the weight of the word;
and adjusting BERT model parameters with the aim of reducing KL divergence.
The parameters of the BERT are finely adjusted through the training to obtain a BERT model with stronger text characterization capability, so that sentence vectors output by the BERT have better effect on text matching tasks.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A short text-oriented matching method, comprising:
obtaining a pair of texts from a corpus, wherein the texts comprise a plurality of sentences, and copying each sentence once for constructing a training set;
the training set trains the BERT model to obtain a final BERT model, wherein the training comprises similarity training between two sentences, dot product or cosine similarity of each sentence vector and each word embedded vector in the sentences, and weight training of the words in the sentences;
inputting the word embedding vector into Encoding of a Transformer to obtain the word position feature vector, and fusing the word position feature vector and the sentence vector to obtain a final sentence vector;
and classifying the final sentence vectors obtained by connection by using a classification network to obtain a matching value of the input text pair.
2. The short text-oriented matching method according to claim 1, wherein the similarity training between two sentences comprises:
inputting the same sentence into the dropout layers of the BERT model to obtain two output positive examples;
inputting the sentences which are different into the dropout layers which are different in the BERT model to obtain two outputs which are opposite examples;
and respectively calculating the similarity of the positive example and the negative example, and adjusting the BERT model parameters with the aim of increasing the similarity of the positive example and reducing the similarity of the negative example.
3. The short text-oriented matching method according to claim 2, wherein the increasing of the positive example similarity and the decreasing of the negative example similarity are aimed at, and the formula for adjusting the BERT model parameters is:
wherein: the molecules in the logarithmic function are positive example similarities; the denominator is the counterexample similarity.
4. A short text-oriented matching method according to claim 3, wherein the dot product or cosine similarity of each of the sentence vectors and each of the word embedded vectors in the sentences thereof and the weight training of the words in the sentences comprises:
inputting the sentence into the BERT model to obtain the sentence vector and each word embedded vector in the sentence, and calculating dot product or cosine similarity between the sentence vector and each word embedded vector;
extracting the weight of each word through the keywords;
converting the dot product or cosine similarity and the weight of the word into probability distribution through softmax operation, and then calculating the KL divergence of the dot product or cosine similarity and the weight of the word;
and adjusting the BERT model parameters aiming at reducing the KL divergence.
5. The short text-oriented matching method according to claim 4, wherein the KL divergence formula is:
wherein: w (w) key Is a weight vector of words; w (w) cls Is a weight vector of the sentence.
6. A short text-oriented matching system, comprising:
the acquisition module is used for acquiring a pair of texts from the corpus, wherein the texts comprise a plurality of sentences, and each sentence is copied once and then used for constructing a training set;
the training module is used for training the BERT model by the training set to obtain a final BERT model, wherein the training comprises similarity training between two sentences, dot product or cosine similarity of each sentence vector and each word embedded vector in the sentences, and weight training of the words in the sentences;
the aggregation module is used for embedding the words into Encoding of the vector input Transformer to obtain the word position feature vectors, and fusing the word position feature vectors and the sentence vectors to obtain final sentence vectors;
and the matching module is used for classifying the final sentence vectors obtained by connection by using a classification network to obtain a matching value of the input text pair.
7. The short text oriented matching system of claim 6, wherein similarity training between two of said sentences comprises:
inputting the same sentence into the dropout layers of the BERT model to obtain two output positive examples;
inputting the sentences which are different into the dropout layers which are different in the BERT model to obtain two outputs which are opposite examples;
and respectively calculating the similarity of the positive example and the negative example, and adjusting the BERT model parameters with the aim of increasing the similarity of the positive example and reducing the similarity of the negative example.
8. The short text-oriented matching system of claim 7, wherein the formulas for increasing the positive example similarity and decreasing the negative example similarity, adjusting the BERT model parameters, are:
wherein: the molecules in the logarithmic function are positive example similarities; the denominator is the counterexample similarity.
9. The short text-oriented matching system of claim 8, wherein the dot product or cosine similarity of each of the sentence vectors and each of the word embedded vectors in the sentences thereof and the weight training of the words in the sentences comprises:
inputting the sentence into the BERT model to obtain the sentence vector and each word embedded vector in the sentence, and calculating dot product or cosine similarity between the sentence vector and each word embedded vector;
extracting the weight of each word through the keywords;
converting the dot product or cosine similarity and the weight of the word into probability distribution through softmax operation, and then calculating the KL divergence of the dot product or cosine similarity and the weight of the word;
and adjusting the BERT model parameters aiming at reducing the KL divergence.
10. The short text-oriented matching system of claim 9, wherein the KL-divergence formula is:
wherein: w (w) key Is a weight vector of words; w (w) cls Is a weight vector of the sentence.
CN202211613205.8A 2022-12-15 2022-12-15 Short text-oriented matching method and system Pending CN116662819A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211613205.8A CN116662819A (en) 2022-12-15 2022-12-15 Short text-oriented matching method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211613205.8A CN116662819A (en) 2022-12-15 2022-12-15 Short text-oriented matching method and system

Publications (1)

Publication Number Publication Date
CN116662819A true CN116662819A (en) 2023-08-29

Family

ID=87717726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211613205.8A Pending CN116662819A (en) 2022-12-15 2022-12-15 Short text-oriented matching method and system

Country Status (1)

Country Link
CN (1) CN116662819A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117744656A (en) * 2023-12-21 2024-03-22 湖南工商大学 Named entity identification method and system combining small sample learning and self-checking

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117744656A (en) * 2023-12-21 2024-03-22 湖南工商大学 Named entity identification method and system combining small sample learning and self-checking

Similar Documents

Publication Publication Date Title
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
CN107122413B (en) Keyword extraction method and device based on graph model
CN107451126B (en) Method and system for screening similar meaning words
CN109165291B (en) Text matching method and electronic equipment
CN106970910B (en) Keyword extraction method and device based on graph model
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN110705296A (en) Chinese natural language processing tool system based on machine learning and deep learning
CN109002473B (en) Emotion analysis method based on word vectors and parts of speech
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN113282711B (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN112883175B (en) Meteorological service interaction method and system combining pre-training model and template generation
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115827819A (en) Intelligent question and answer processing method and device, electronic equipment and storage medium
CN114428850A (en) Text retrieval matching method and system
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN116662819A (en) Short text-oriented matching method and system
CN114239599A (en) Method, system, equipment and medium for realizing machine reading understanding
Park et al. Automatic analysis of thematic structure in written English
Ye et al. A sentiment based non-factoid question-answering framework
CN113590768B (en) Training method and device for text relevance model, question answering method and device
CN114417880A (en) Interactive intelligent question-answering method based on power grid practical training question-answering knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination