CN112257453B

CN112257453B - Chinese-Yue text similarity calculation method fusing keywords and semantic features

Info

Publication number: CN112257453B
Application number: CN202011006911.7A
Authority: CN
Inventors: 高盛祥; 潘润海; 余正涛; 毛存礼; 朱俊国; 王振晗
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2022-02-22
Anticipated expiration: 2040-09-23
Also published as: CN112257453A

Abstract

The invention relates to a Chinese-Yue text similarity calculation method fusing keywords and semantic features, and belongs to the technical field of natural language processing. The invention comprises the following steps: extracting keywords of the Chinese and Vietnamese articles, translating the Vietnamese keywords into Chinese, and calculating co-occurrence keywords in the two articles to obtain similar information of words; extracting closely related sentences by utilizing the co-occurrence keywords, splicing to represent texts, and removing irrelevant sentences to compress the texts; then, a Hanyue BERT model is trained by knowledge distillation to encode the compressed text so as to obtain context semantic features; and finally, fusing the similar information of the words and the context semantic features to realize text relevance judgment. The method and the device improve the accuracy of the similarity calculation of the Chinese-Yuan text.

Description

Chinese-Yue text similarity calculation method fusing keywords and semantic features

Technical Field

The invention relates to a Chinese-Yue text similarity calculation method fusing keywords and semantic features, and belongs to the technical field of natural language processing.

Background

The calculation of the similarity of the Chinese and Vietnamese texts plays an important supporting role in the aspects of cross-language information retrieval, multi-language document clustering, machine translation, bilingual corpus construction and the like of the Chinese and Vietnamese. At present, in view of the lack of text-level corpus resources of training and poor Chinese-to-Chinese translation effect, the similarity calculation of the Chinese-to-Chinese text is faced with many problems at present. Therefore, it is very necessary to provide a text similarity calculation method for situations of scarcity of chinese vietnamese material and poor translation quality.

Recently, with the development of feature extractors such as LSTM and Transfomer, the feature extraction effect on sentence level has been good. However, for text features, the hanyue text often contains a large amount of redundant information, and the key text information does not extend through the whole article, so that capturing the key context information by using a neural network becomes difficult; meanwhile, the chinese and vietnamese languages are not aligned in the vector space of the neural network. Therefore, some scholars are beginning to consider using methods based on translation, inter-translated word logarithm, vector space model, LDA topic model, etc. to solve the similarity calculation problem at the text level.

Disclosure of Invention

The invention provides a Chinese-Yue text similarity calculation method fusing keywords and semantic features, which is used for solving the problems that the similarity calculation effect is poor by using a translation method and the text information is not sufficiently captured by a neural network.

The technical scheme of the invention is as follows: the method for calculating the similarity of the Hanyue text fusing keywords and semantic features comprises the following steps:

step1, preprocessing the corpus data of the Chinese-Yue text, and splitting the text into a word sequence and a sentence sequence; step2, taking the sequence of words as the input of a keyword acquisition layer, carrying out different processing on Vietnamese and Chinese to obtain the information of co-occurring keywords between texts, and calculating the text similarity information based on the keywords;

step3, taking the sentence sequence as the input of a text compression layer, removing sentences irrelevant to co-occurrence keywords based on co-occurrence keyword information to compress the text, splicing the sentences containing the co-occurrence keywords, respectively inputting a Chinese-YueBERT model, capturing the context semantic features of the text, and calculating the similarity of the semantic features based on the sentences;

and Step4, fusing the similar information based on the keywords and the semantic features based on the sentences to obtain the similar information of the final text.

As a further aspect of the present invention, Step1 is:

step1, firstly, performing word segmentation and stop word removal on the Chinese-more-parallel text corpus data, and taking a sequence of words and a sequence of sentences of the text as the input of a downstream model.

As a further scheme of the present invention, the Step1 specifically comprises the following steps:

step1.1, preprocessing the data of the Chinese-Vietnam parallel text corpus, inputting the preprocessed data into a Chinese document and a Vietnam document, wherein the Chinese document and the Vietnam document are respectively split into sequences W of words_C＝(C₁,C₂,…,C_n),W_V＝(V₁,V₂,…,V_n) And a sequence S of sentences_C＝(S_c1,S_c2,…,S_cn),S_V＝(S_v1,S_v2,…,S_vn) The word sequence is used as the input of the keyword acquisition layer and processed to acquire the similar information of words in the text, and the sentence sequence is used as the input of the text compression layer and processed to acquire the text context semantic similar characteristics.

As a further scheme of the present invention, the Step2 specifically comprises the following steps:

step2.1, extracting keywords in a document by using a keyword extraction algorithm TextRank and obtaining the weight W of the keywords, wherein the weight calculation formula is shown as follows, the algorithm expresses information among the words as a directed weighted graph G (V, E), and V is a point set E and is an edge set;

where d is the damping coefficient, WS (V)_i) And WS (V)_j) Respectively represent words V_iAnd V_jWeight of (d), In (V)_i) And Out (V)_j) Respectively represent words V_iAnd V_jIn and out of, w_jiAnd w_jkEach represents V_i，V_jAnd V_j，V_kAn inter-weight;

step2.2, after keyword information of a text is output through a TextRank algorithm, translating the Vietnam text keywords into Chinese through a Translation module, calculating a near meaning word set related to the keywords by using a Chinese synonym tool Synonyms for the translated keywords and the Chinese keywords, and fusing the near meaning words and the keywords of an article to form the Chinese text keyword set and the Vietnam text keyword set;

step2.3, in order to obtain the keyword similarity characteristics of the two documents, the co-occurrence keywords of the two articles are obtained by utilizing the word set of the Chinese text keywords and the word set of the Vietnamese text keywords, and the similarity of the keywords of the two articles is obtained by utilizing the weight of the co-occurrence keywords to account for the weight of all extracted keywords.

As a further aspect of the present invention, in step2.3, a specific method for determining similarity between two article keywords by using weights of co-occurring keywords in all extracted keywords is as follows: the proportion of the keywords which are co-occurring in the two articles in all the extracted keywords is used as text similarity information obtained based on the keywords, and the calculation formula is as follows:

wherein W_IiAnd W_CiRespectively representing the weight of the ith word of the co-occurrence keywords and the weight of the ith word of all the extracted keywords, wherein n is the number of the keywords, and m is the number of the co-occurrence keywords.

As a further aspect of the present invention, Step3 is:

in order to map the sentences or short text paragraphs of Hanyue to a dense vector space, a Hanyue BERT model (ZH-VI BERT) capable of capturing the upper and lower semantic information of Hanyue is trained, and the existing sentence embedding model is expanded to a new language by adopting a knowledge distillation method; source language s is mapped to a dense vector space using teacher model M, with training data being pairs of chinese-crossing parallel sentences ((s)₁,t₁),...,(s_n,t_n) Wherein s) is_iIs the source language, t_iTraining new student models for target languages

Make it

And

this method is called multi-language knowledge distillation learning because students

The knowledge of teacher M is refined, the minimum batch B is given, the mean square loss MSE of the minimum batch B is minimized, and the calculation formula is shown as follows.

Student model

The system can be provided with the structure and the weight of a teacher model M, and can also be provided with other network system structures with completely different weights, a Chinese BERT model is used as the teacher model, and a student model is a multi-language BERT model;

and inputting the compressed text into the trained Chinese-YueBERT model, and performing semantic capture on the compressed text to obtain the context semantic features.

As a further aspect of the present invention, Step3 captures context semantic features of a text, and calculates semantic feature similarity based on sentences as follows:

inputting the Chinese short text and Vietnamese short text related to the co-occurrence keywords into a Chinese-Vietnamese BERT model respectively to enable the Chinese-Vietnamese short text and the Vietnamese short text to be coded, calculating the cosine distance of two vectors by using cosine similarity of the coded output feature vectors, wherein the calculation formula is as follows:

wherein a is_iRepresenting Chinese short text S₁The ith eigenvalue of the vector of (b)_iRepresenting Vietnamese short text S₂The ith eigenvalue of the vector of (a);

after obtaining the similar information of the short texts of the Chinese Vietnamese based on the co-occurrence keywords, averaging the similar information to obtain the similarity Sim based on the context semantic information₂The calculation formula is shown as follows:

wherein F_iRepresenting the context semantic similarity of the ith keyword.

The invention has the beneficial effects that: the method provided by the invention solves the problems that the similarity calculation effect is poor by using a translation method and the text information is not sufficiently captured by a neural network, and improves the accuracy of Chinese-Yuan text similarity calculation.

Drawings

FIG. 1 is a general model architecture diagram of the present invention;

FIG. 2 is a training diagram of the Han-Yuebert model of the present invention.

Detailed Description

Example 1: as shown in fig. 1-2, a method for computing similarity of hanyue text fusing keywords and semantic features, the method comprising:

step1, preprocessing the Chinese and Vietnamese parallel text corpus, inputting into Chinese and Vietnamese documents, wherein the Chinese and Vietnamese documents are respectively split into word sequence W_C＝(C₁,C₂,…,C_n),W_V＝(V₁,V₂,…,V_n) And a sequence S of sentences_C＝(S_c1,S_c2,…,S_cn),S_V＝(S_v1,S_v2,…,S_vn) The word sequence as the input of the keyword acquisition layer is processed to obtain the similar information of words in the text, and the sentence sequence as the input of the text compression layer is processed to obtain the text contextSemantic similar features;

step2, taking the sequence of words as the input of a keyword acquisition layer, carrying out different processing on Vietnamese and Chinese to obtain the information of co-occurring keywords between texts, and calculating the text similarity information based on the keywords;

step2.2, because Chinese text and Vietnamese text have certain differences in terms, the problem that the similar meaning words of the calculated keywords are used for reducing the problem of the difference in terms of cross-language documents is solved. After keyword information of a text is output through a TextRank algorithm, Vietnamese keywords are translated into Chinese through a Translation module, the translated keywords and the Chinese keywords calculate a near-meaning word set related to the keywords by using a Chinese synonym tool Synonyms, and the near-meaning words and keywords of an article are fused to form a Chinese text keyword word set and a Vietnamese text keyword word set;

step2.3, in order to obtain the keyword similarity characteristics of the two documents, obtaining the co-occurrence keywords of the two articles by using the obtained Chinese text keyword word set and Vietnamese text keyword word set, and obtaining the similarity of the keywords of the two articles by using the weight of the co-occurrence keywords to account for the weight of all extracted keywords, wherein the specific method comprises the following steps of: the proportion of the keywords which are co-occurring in the two articles in all the extracted keywords is used as text similarity information obtained based on the keywords, and the calculation formula is as follows:

Make it

And

Student model

It may be a network architecture with the structure and weight of the teacher model M, or other network architectures with completely different weights, and the training process is shown in fig. 2. Using a Chinese BERT model as a teacher model, and a student model as a multi-language BERT model;

The specific method for capturing the context semantic features of the text and calculating the similarity of the semantic features based on sentences is as follows:

wherein F_iRepresenting the context semantic similarity of the ith keyword.

The method specifically comprises the following steps: and calculating an average value of the two similarity information obtained by calculation, namely the similarity of the two articles, wherein a calculation formula is shown as the following formula, and the obtained results are between 0 and 1, wherein 0 represents completely different, and 1 represents completely same.

In fig. 1, the model of the present invention comprises the following:

a data preprocessing layer: firstly, aiming at the characteristics of a text and the properties of a neural network, the Chinese-crossing text data is preprocessed, and meanwhile, the data meets the requirements of a model.

A keyword acquisition layer: in order to obtain similar information of words among texts, aiming at the fact that a number of redundant information is often contained in a Chinese-Yuan text, and key text information does not penetrate through the whole article, it is difficult to capture key context information by using a neural network, so that a similarity calculation task of the Chinese-Yuan text is considered to be converted into a similarity calculation task of keywords and key sentences, and the keywords are extracted from the text and sentences capable of expressing the core semantics of the article are taken into consideration to realize similarity calculation.

A statistical characteristic acquisition layer: in order to obtain the keyword similarity characteristics of two documents, the Chinese text keyword word set and the Vietnamese text keyword word set obtained by a keyword obtaining layer are used for obtaining co-occurrence keywords of the two articles, and the weight of the co-occurrence keywords accounts for the weight of all extracted keywords to obtain the similarity of the keywords of the two articles.

Text compression layer: aiming at the characteristic that a text has more redundant information, in order to extract key information of an article, the text compression method based on co-occurrence keywords is provided. The method comprises the steps of compressing a text by utilizing co-occurrence keywords extracted by a statistical characteristic acquisition layer, retaining sentences associated with the text, removing irrelevant sentences, and respectively splicing Chinese and Vietnamese of the same keywords to form two short texts as input of a context characteristic acquisition layer if the same keywords comprise a plurality of sentences.

Context feature acquisition layer: to capture contextual features, the most current feature coding BERT is used herein to code the chinese-transcritical sentences to obtain contextual semantic features.

Prediction layer: based on the two similar information obtained by the statistical characteristic layer and the context characteristic acquisition layer, the similar information of two different dimensions of the keywords and the sentences is obtained.

In order to train a Chinese-Vietnamese BERT model to realize semantic feature extraction of Chinese and Vietnamese, 50 ten thousand Chinese-Vietnamese parallel sentence pairs are constructed, in order to verify the validity of the method provided by the text, algorithms provided and used by the text are tested, text-level aligned news data and story data sets of some Chinese Vietnamese are obtained from the network, and 400 sets and 800 sets form a standard set to verify the validity of the text algorithm.

The invention adopts the accuracy of matching the text by the algorithm to measure the effectiveness of the algorithm, namely, the effectiveness of the algorithm is evaluated by matching the value of the correct text quantity in the total text.

When training the Hanyue BERT model, the activation function used herein is GELU, the hidden layer dimension is 768, the number of attention heads is 12, the number of hidden layers is 12, the hidden layer dropout probability is 0.1, the learning rate is 2e-5, the learning rate optimizer is Adam, the batch size is set to 32, epochs are 20, and the dictionary uses a dictionary of the multilingual BERT model, and the size is 119547.

To verify the effectiveness of the method proposed herein, comparative experiments with some existing cross-language similarity calculation methods were set up to compare the method with results obtained from LDA topic models, full-text translation and BiRNN. The similarity calculation method for full-text translation translates Vietnamese into Chinese, represents sentences of two texts through BERT, and calculates the distance between the sentences to obtain the similarity of the texts.

The accuracy of the model proposed by the present invention is compared to some existing models as shown in table 1. Obviously, the method of the text is better performed for the topic model, which shows that the text compression and BERT-based context semantic coding are more excellent in mining of similar information of the text; meanwhile, the text method and the full-text translation method are improved to a certain extent, and the reason is that certain translation errors are caused when a translation system is used for translating the Chinese-crossing text, and the translation errors of the text only for the extracted keywords are effectively reduced compared with the full-text translation errors; compared with BiRNN, the accuracy is improved by 3.5 percent; therefore, the method provided by the invention is improved to a certain extent compared with some existing similarity calculation models.

The Chinese-Yue text similarity calculation fusing the keywords and the semantic features has better performance on a Chinese-Yue text similarity calculation task, and mainly has the following reasons: 1. the most mainstream feature extractor BERT at present is used, so that the sentence semantic capturing capability is improved; 2. a similarity method for fusing co-occurrence keyword information and context semantic features among texts is provided, and similar information among texts is mined from different dimensions; 3. the method can solve the problem of text information redundancy, so that the neural network can capture text information more effectively.

TABLE 1 Han-Woods lower resources translation experiment accuracy comparison

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The method for calculating the similarity of the Hanyue text fusing the keywords and the semantic features is characterized by comprising the following steps of: the method comprises the following steps:

step1, preprocessing the corpus data of the Chinese-Yue text, and splitting the text into a word sequence and a sentence sequence;

step4, fusing the similar information based on the keywords and the semantic features based on the sentences to obtain the similar information of the final text;

the specific steps of Step2 are as follows:

step2.3, in order to obtain the keyword similarity characteristics of the two documents, obtaining the co-occurrence keywords of the two articles by using the obtained Chinese text keyword word set and Vietnamese text keyword word set, and obtaining the similarity of the keywords of the two articles by using the weight of the co-occurrence keywords to account for the weight of all extracted keywords;

in step2.3, the specific method for determining the similarity between two article keywords by using the weight of the co-occurrence keywords in all the extracted keywords is as follows: the proportion of the keywords which are co-occurring in the two articles in all the extracted keywords is used as text similarity information obtained based on the keywords, and the calculation formula is as follows:

wherein W_IiAnd W_CiRespectively representing the weight of the ith word of the co-occurrence keywords and the weight of the ith word of all the extracted keywords, wherein n is the number of the keywords, and m is the number of the co-occurrence keywords;

in Step 3:

in order to map the sentences or short text paragraphs of the Hanyue to a dense vector space, a Hanyue BERT model capable of capturing the upper and lower semantic information of the Hanyue is trained, and the existing sentence embedding model is expanded to a new language by adopting a knowledge distillation method; source language s is mapped to a dense vector space using teacher model M, with training data being pairs of chinese-crossing parallel sentences ((s)₁,t₁),...,(s_n,t_n) Wherein s) is_iIs the source language, t_iTraining new student models for target languages

Make it

And

The knowledge of the teacher M is refined, the minimum batch B is given, the mean square loss MSE of the minimum batch B is minimized, and the calculation formula is as follows:

student model

the trained model is called a Chinese-more BERT model, and the compressed text is input into the trained Chinese-more BERT model and subjected to semantic capture to obtain context semantic features;

in Step3, capturing context semantic features of a text, and calculating semantic feature similarity based on sentences as follows:

wherein F_iRepresenting the context semantic similarity of the ith keyword.

2. The method for calculating similarity of Chinese-Yuan text fusing keywords and semantic features according to claim 1, characterized in that: in Step 1:

3. The method for calculating similarity of Chinese-Yuan text fusing keywords and semantic features according to claim 1, characterized in that: the specific steps of Step1 are as follows:

step1.1, preprocessing the data of the Chinese-Vietnam parallel text corpus, inputting the preprocessed data into a Chinese document and a Vietnam document, wherein the Chinese document and the Vietnam document are respectively split into sequences W of words_C＝(C₁,C₂,…,C_n),W_V＝(V₁,V₂,…,V_n) And a sequence S of sentences_C＝(S_c1,S_c2,…,S_cn),S_V＝(S_v1,S_v2,…,S_vn) The sequence of words is used as the input of the keyword acquisition layer and is processed to acquire the textsSimilar information of words, and a sentence sequence is used as input of a text compression layer and is processed to obtain text context semantic similar characteristics.