CN112257453B - Chinese-Yue text similarity calculation method fusing keywords and semantic features - Google Patents

Chinese-Yue text similarity calculation method fusing keywords and semantic features Download PDF

Info

Publication number
CN112257453B
CN112257453B CN202011006911.7A CN202011006911A CN112257453B CN 112257453 B CN112257453 B CN 112257453B CN 202011006911 A CN202011006911 A CN 202011006911A CN 112257453 B CN112257453 B CN 112257453B
Authority
CN
China
Prior art keywords
text
keywords
chinese
similarity
vietnamese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011006911.7A
Other languages
Chinese (zh)
Other versions
CN112257453A (en
Inventor
高盛祥
潘润海
余正涛
毛存礼
朱俊国
王振晗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202011006911.7A priority Critical patent/CN112257453B/en
Publication of CN112257453A publication Critical patent/CN112257453A/en
Application granted granted Critical
Publication of CN112257453B publication Critical patent/CN112257453B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a Chinese-Yue text similarity calculation method fusing keywords and semantic features, and belongs to the technical field of natural language processing. The invention comprises the following steps: extracting keywords of the Chinese and Vietnamese articles, translating the Vietnamese keywords into Chinese, and calculating co-occurrence keywords in the two articles to obtain similar information of words; extracting closely related sentences by utilizing the co-occurrence keywords, splicing to represent texts, and removing irrelevant sentences to compress the texts; then, a Hanyue BERT model is trained by knowledge distillation to encode the compressed text so as to obtain context semantic features; and finally, fusing the similar information of the words and the context semantic features to realize text relevance judgment. The method and the device improve the accuracy of the similarity calculation of the Chinese-Yuan text.

Description

Chinese-Yue text similarity calculation method fusing keywords and semantic features
Technical Field
The invention relates to a Chinese-Yue text similarity calculation method fusing keywords and semantic features, and belongs to the technical field of natural language processing.
Background
The calculation of the similarity of the Chinese and Vietnamese texts plays an important supporting role in the aspects of cross-language information retrieval, multi-language document clustering, machine translation, bilingual corpus construction and the like of the Chinese and Vietnamese. At present, in view of the lack of text-level corpus resources of training and poor Chinese-to-Chinese translation effect, the similarity calculation of the Chinese-to-Chinese text is faced with many problems at present. Therefore, it is very necessary to provide a text similarity calculation method for situations of scarcity of chinese vietnamese material and poor translation quality.
Recently, with the development of feature extractors such as LSTM and Transfomer, the feature extraction effect on sentence level has been good. However, for text features, the hanyue text often contains a large amount of redundant information, and the key text information does not extend through the whole article, so that capturing the key context information by using a neural network becomes difficult; meanwhile, the chinese and vietnamese languages are not aligned in the vector space of the neural network. Therefore, some scholars are beginning to consider using methods based on translation, inter-translated word logarithm, vector space model, LDA topic model, etc. to solve the similarity calculation problem at the text level.
Disclosure of Invention
The invention provides a Chinese-Yue text similarity calculation method fusing keywords and semantic features, which is used for solving the problems that the similarity calculation effect is poor by using a translation method and the text information is not sufficiently captured by a neural network.
The technical scheme of the invention is as follows: the method for calculating the similarity of the Hanyue text fusing keywords and semantic features comprises the following steps:
step1, preprocessing the corpus data of the Chinese-Yue text, and splitting the text into a word sequence and a sentence sequence; step2, taking the sequence of words as the input of a keyword acquisition layer, carrying out different processing on Vietnamese and Chinese to obtain the information of co-occurring keywords between texts, and calculating the text similarity information based on the keywords;
step3, taking the sentence sequence as the input of a text compression layer, removing sentences irrelevant to co-occurrence keywords based on co-occurrence keyword information to compress the text, splicing the sentences containing the co-occurrence keywords, respectively inputting a Chinese-YueBERT model, capturing the context semantic features of the text, and calculating the similarity of the semantic features based on the sentences;
and Step4, fusing the similar information based on the keywords and the semantic features based on the sentences to obtain the similar information of the final text.
As a further aspect of the present invention, Step1 is:
step1, firstly, performing word segmentation and stop word removal on the Chinese-more-parallel text corpus data, and taking a sequence of words and a sequence of sentences of the text as the input of a downstream model.
As a further scheme of the present invention, the Step1 specifically comprises the following steps:
step1.1, preprocessing the data of the Chinese-Vietnam parallel text corpus, inputting the preprocessed data into a Chinese document and a Vietnam document, wherein the Chinese document and the Vietnam document are respectively split into sequences W of wordsC=(C1,C2,…,Cn),WV=(V1,V2,…,Vn) And a sequence S of sentencesC=(Sc1,Sc2,…,Scn),SV=(Sv1,Sv2,…,Svn) The word sequence is used as the input of the keyword acquisition layer and processed to acquire the similar information of words in the text, and the sentence sequence is used as the input of the text compression layer and processed to acquire the text context semantic similar characteristics.
As a further scheme of the present invention, the Step2 specifically comprises the following steps:
step2.1, extracting keywords in a document by using a keyword extraction algorithm TextRank and obtaining the weight W of the keywords, wherein the weight calculation formula is shown as follows, the algorithm expresses information among the words as a directed weighted graph G (V, E), and V is a point set E and is an edge set;
Figure BDA0002696261060000021
where d is the damping coefficient, WS (V)i) And WS (V)j) Respectively represent words ViAnd VjWeight of (d), In (V)i) And Out (V)j) Respectively represent words ViAnd VjIn and out of, wjiAnd wjkEach represents Vi,VjAnd Vj,VkAn inter-weight;
step2.2, after keyword information of a text is output through a TextRank algorithm, translating the Vietnam text keywords into Chinese through a Translation module, calculating a near meaning word set related to the keywords by using a Chinese synonym tool Synonyms for the translated keywords and the Chinese keywords, and fusing the near meaning words and the keywords of an article to form the Chinese text keyword set and the Vietnam text keyword set;
step2.3, in order to obtain the keyword similarity characteristics of the two documents, the co-occurrence keywords of the two articles are obtained by utilizing the word set of the Chinese text keywords and the word set of the Vietnamese text keywords, and the similarity of the keywords of the two articles is obtained by utilizing the weight of the co-occurrence keywords to account for the weight of all extracted keywords.
As a further aspect of the present invention, in step2.3, a specific method for determining similarity between two article keywords by using weights of co-occurring keywords in all extracted keywords is as follows: the proportion of the keywords which are co-occurring in the two articles in all the extracted keywords is used as text similarity information obtained based on the keywords, and the calculation formula is as follows:
Figure BDA0002696261060000031
wherein WIiAnd WCiRespectively representing the weight of the ith word of the co-occurrence keywords and the weight of the ith word of all the extracted keywords, wherein n is the number of the keywords, and m is the number of the co-occurrence keywords.
As a further aspect of the present invention, Step3 is:
in order to map the sentences or short text paragraphs of Hanyue to a dense vector space, a Hanyue BERT model (ZH-VI BERT) capable of capturing the upper and lower semantic information of Hanyue is trained, and the existing sentence embedding model is expanded to a new language by adopting a knowledge distillation method; source language s is mapped to a dense vector space using teacher model M, with training data being pairs of chinese-crossing parallel sentences ((s)1,t1),...,(sn,tn) Wherein s) isiIs the source language, tiTraining new student models for target languages
Figure BDA0002696261060000032
Make it
Figure BDA0002696261060000033
And
Figure BDA0002696261060000034
this method is called multi-language knowledge distillation learning because students
Figure BDA0002696261060000035
The knowledge of teacher M is refined, the minimum batch B is given, the mean square loss MSE of the minimum batch B is minimized, and the calculation formula is shown as follows.
Figure BDA0002696261060000036
Student model
Figure BDA0002696261060000037
The system can be provided with the structure and the weight of a teacher model M, and can also be provided with other network system structures with completely different weights, a Chinese BERT model is used as the teacher model, and a student model is a multi-language BERT model;
and inputting the compressed text into the trained Chinese-YueBERT model, and performing semantic capture on the compressed text to obtain the context semantic features.
As a further aspect of the present invention, Step3 captures context semantic features of a text, and calculates semantic feature similarity based on sentences as follows:
inputting the Chinese short text and Vietnamese short text related to the co-occurrence keywords into a Chinese-Vietnamese BERT model respectively to enable the Chinese-Vietnamese short text and the Vietnamese short text to be coded, calculating the cosine distance of two vectors by using cosine similarity of the coded output feature vectors, wherein the calculation formula is as follows:
Figure BDA0002696261060000038
wherein a isiRepresenting Chinese short text S1The ith eigenvalue of the vector of (b)iRepresenting Vietnamese short text S2The ith eigenvalue of the vector of (a);
after obtaining the similar information of the short texts of the Chinese Vietnamese based on the co-occurrence keywords, averaging the similar information to obtain the similarity Sim based on the context semantic information2The calculation formula is shown as follows:
Figure BDA0002696261060000041
wherein FiRepresenting the context semantic similarity of the ith keyword.
The invention has the beneficial effects that: the method provided by the invention solves the problems that the similarity calculation effect is poor by using a translation method and the text information is not sufficiently captured by a neural network, and improves the accuracy of Chinese-Yuan text similarity calculation.
Drawings
FIG. 1 is a general model architecture diagram of the present invention;
FIG. 2 is a training diagram of the Han-Yuebert model of the present invention.
Detailed Description
Example 1: as shown in fig. 1-2, a method for computing similarity of hanyue text fusing keywords and semantic features, the method comprising:
step1, preprocessing the Chinese and Vietnamese parallel text corpus, inputting into Chinese and Vietnamese documents, wherein the Chinese and Vietnamese documents are respectively split into word sequence WC=(C1,C2,…,Cn),WV=(V1,V2,…,Vn) And a sequence S of sentencesC=(Sc1,Sc2,…,Scn),SV=(Sv1,Sv2,…,Svn) The word sequence as the input of the keyword acquisition layer is processed to obtain the similar information of words in the text, and the sentence sequence as the input of the text compression layer is processed to obtain the text contextSemantic similar features;
step2, taking the sequence of words as the input of a keyword acquisition layer, carrying out different processing on Vietnamese and Chinese to obtain the information of co-occurring keywords between texts, and calculating the text similarity information based on the keywords;
step2.1, extracting keywords in a document by using a keyword extraction algorithm TextRank and obtaining the weight W of the keywords, wherein the weight calculation formula is shown as follows, the algorithm expresses information among the words as a directed weighted graph G (V, E), and V is a point set E and is an edge set;
Figure BDA0002696261060000042
where d is the damping coefficient, WS (V)i) And WS (V)j) Respectively represent words ViAnd VjWeight of (d), In (V)i) And Out (V)j) Respectively represent words ViAnd VjIn and out of, wjiAnd wjkEach represents Vi,VjAnd Vj,VkAn inter-weight;
step2.2, because Chinese text and Vietnamese text have certain differences in terms, the problem that the similar meaning words of the calculated keywords are used for reducing the problem of the difference in terms of cross-language documents is solved. After keyword information of a text is output through a TextRank algorithm, Vietnamese keywords are translated into Chinese through a Translation module, the translated keywords and the Chinese keywords calculate a near-meaning word set related to the keywords by using a Chinese synonym tool Synonyms, and the near-meaning words and keywords of an article are fused to form a Chinese text keyword word set and a Vietnamese text keyword word set;
step2.3, in order to obtain the keyword similarity characteristics of the two documents, obtaining the co-occurrence keywords of the two articles by using the obtained Chinese text keyword word set and Vietnamese text keyword word set, and obtaining the similarity of the keywords of the two articles by using the weight of the co-occurrence keywords to account for the weight of all extracted keywords, wherein the specific method comprises the following steps of: the proportion of the keywords which are co-occurring in the two articles in all the extracted keywords is used as text similarity information obtained based on the keywords, and the calculation formula is as follows:
Figure BDA0002696261060000051
wherein WIiAnd WCiRespectively representing the weight of the ith word of the co-occurrence keywords and the weight of the ith word of all the extracted keywords, wherein n is the number of the keywords, and m is the number of the co-occurrence keywords.
Step3, taking the sentence sequence as the input of a text compression layer, removing sentences irrelevant to co-occurrence keywords based on co-occurrence keyword information to compress the text, splicing the sentences containing the co-occurrence keywords, respectively inputting a Chinese-YueBERT model, capturing the context semantic features of the text, and calculating the similarity of the semantic features based on the sentences;
in order to map the sentences or short text paragraphs of Hanyue to a dense vector space, a Hanyue BERT model (ZH-VI BERT) capable of capturing the upper and lower semantic information of Hanyue is trained, and the existing sentence embedding model is expanded to a new language by adopting a knowledge distillation method; source language s is mapped to a dense vector space using teacher model M, with training data being pairs of chinese-crossing parallel sentences ((s)1,t1),...,(sn,tn) Wherein s) isiIs the source language, tiTraining new student models for target languages
Figure BDA0002696261060000052
Make it
Figure BDA0002696261060000053
And
Figure BDA0002696261060000054
this method is called multi-language knowledge distillation learning because students
Figure BDA0002696261060000055
The knowledge of teacher M is refined, the minimum batch B is given, the mean square loss MSE of the minimum batch B is minimized, and the calculation formula is shown as follows.
Figure BDA0002696261060000056
Student model
Figure BDA0002696261060000057
It may be a network architecture with the structure and weight of the teacher model M, or other network architectures with completely different weights, and the training process is shown in fig. 2. Using a Chinese BERT model as a teacher model, and a student model as a multi-language BERT model;
and inputting the compressed text into the trained Chinese-YueBERT model, and performing semantic capture on the compressed text to obtain the context semantic features.
The specific method for capturing the context semantic features of the text and calculating the similarity of the semantic features based on sentences is as follows:
inputting the Chinese short text and Vietnamese short text related to the co-occurrence keywords into a Chinese-Vietnamese BERT model respectively to enable the Chinese-Vietnamese short text and the Vietnamese short text to be coded, calculating the cosine distance of two vectors by using cosine similarity of the coded output feature vectors, wherein the calculation formula is as follows:
Figure BDA0002696261060000061
wherein a isiRepresenting Chinese short text S1The ith eigenvalue of the vector of (b)iRepresenting Vietnamese short text S2The ith eigenvalue of the vector of (a);
after obtaining the similar information of the short texts of the Chinese Vietnamese based on the co-occurrence keywords, averaging the similar information to obtain the similarity Sim based on the context semantic information2The calculation formula is shown as follows:
Figure BDA0002696261060000062
wherein FiRepresenting the context semantic similarity of the ith keyword.
And Step4, fusing the similar information based on the keywords and the semantic features based on the sentences to obtain the similar information of the final text.
The method specifically comprises the following steps: and calculating an average value of the two similarity information obtained by calculation, namely the similarity of the two articles, wherein a calculation formula is shown as the following formula, and the obtained results are between 0 and 1, wherein 0 represents completely different, and 1 represents completely same.
Figure BDA0002696261060000063
In fig. 1, the model of the present invention comprises the following:
a data preprocessing layer: firstly, aiming at the characteristics of a text and the properties of a neural network, the Chinese-crossing text data is preprocessed, and meanwhile, the data meets the requirements of a model.
A keyword acquisition layer: in order to obtain similar information of words among texts, aiming at the fact that a number of redundant information is often contained in a Chinese-Yuan text, and key text information does not penetrate through the whole article, it is difficult to capture key context information by using a neural network, so that a similarity calculation task of the Chinese-Yuan text is considered to be converted into a similarity calculation task of keywords and key sentences, and the keywords are extracted from the text and sentences capable of expressing the core semantics of the article are taken into consideration to realize similarity calculation.
A statistical characteristic acquisition layer: in order to obtain the keyword similarity characteristics of two documents, the Chinese text keyword word set and the Vietnamese text keyword word set obtained by a keyword obtaining layer are used for obtaining co-occurrence keywords of the two articles, and the weight of the co-occurrence keywords accounts for the weight of all extracted keywords to obtain the similarity of the keywords of the two articles.
Text compression layer: aiming at the characteristic that a text has more redundant information, in order to extract key information of an article, the text compression method based on co-occurrence keywords is provided. The method comprises the steps of compressing a text by utilizing co-occurrence keywords extracted by a statistical characteristic acquisition layer, retaining sentences associated with the text, removing irrelevant sentences, and respectively splicing Chinese and Vietnamese of the same keywords to form two short texts as input of a context characteristic acquisition layer if the same keywords comprise a plurality of sentences.
Context feature acquisition layer: to capture contextual features, the most current feature coding BERT is used herein to code the chinese-transcritical sentences to obtain contextual semantic features.
Prediction layer: based on the two similar information obtained by the statistical characteristic layer and the context characteristic acquisition layer, the similar information of two different dimensions of the keywords and the sentences is obtained.
In order to train a Chinese-Vietnamese BERT model to realize semantic feature extraction of Chinese and Vietnamese, 50 ten thousand Chinese-Vietnamese parallel sentence pairs are constructed, in order to verify the validity of the method provided by the text, algorithms provided and used by the text are tested, text-level aligned news data and story data sets of some Chinese Vietnamese are obtained from the network, and 400 sets and 800 sets form a standard set to verify the validity of the text algorithm.
The invention adopts the accuracy of matching the text by the algorithm to measure the effectiveness of the algorithm, namely, the effectiveness of the algorithm is evaluated by matching the value of the correct text quantity in the total text.
When training the Hanyue BERT model, the activation function used herein is GELU, the hidden layer dimension is 768, the number of attention heads is 12, the number of hidden layers is 12, the hidden layer dropout probability is 0.1, the learning rate is 2e-5, the learning rate optimizer is Adam, the batch size is set to 32, epochs are 20, and the dictionary uses a dictionary of the multilingual BERT model, and the size is 119547.
To verify the effectiveness of the method proposed herein, comparative experiments with some existing cross-language similarity calculation methods were set up to compare the method with results obtained from LDA topic models, full-text translation and BiRNN. The similarity calculation method for full-text translation translates Vietnamese into Chinese, represents sentences of two texts through BERT, and calculates the distance between the sentences to obtain the similarity of the texts.
The accuracy of the model proposed by the present invention is compared to some existing models as shown in table 1. Obviously, the method of the text is better performed for the topic model, which shows that the text compression and BERT-based context semantic coding are more excellent in mining of similar information of the text; meanwhile, the text method and the full-text translation method are improved to a certain extent, and the reason is that certain translation errors are caused when a translation system is used for translating the Chinese-crossing text, and the translation errors of the text only for the extracted keywords are effectively reduced compared with the full-text translation errors; compared with BiRNN, the accuracy is improved by 3.5 percent; therefore, the method provided by the invention is improved to a certain extent compared with some existing similarity calculation models.
The Chinese-Yue text similarity calculation fusing the keywords and the semantic features has better performance on a Chinese-Yue text similarity calculation task, and mainly has the following reasons: 1. the most mainstream feature extractor BERT at present is used, so that the sentence semantic capturing capability is improved; 2. a similarity method for fusing co-occurrence keyword information and context semantic features among texts is provided, and similar information among texts is mined from different dimensions; 3. the method can solve the problem of text information redundancy, so that the neural network can capture text information more effectively.
TABLE 1 Han-Woods lower resources translation experiment accuracy comparison
Figure BDA0002696261060000081
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (3)

1. The method for calculating the similarity of the Hanyue text fusing the keywords and the semantic features is characterized by comprising the following steps of: the method comprises the following steps:
step1, preprocessing the corpus data of the Chinese-Yue text, and splitting the text into a word sequence and a sentence sequence;
step2, taking the sequence of words as the input of a keyword acquisition layer, carrying out different processing on Vietnamese and Chinese to obtain the information of co-occurring keywords between texts, and calculating the text similarity information based on the keywords;
step3, taking the sentence sequence as the input of a text compression layer, removing sentences irrelevant to co-occurrence keywords based on co-occurrence keyword information to compress the text, splicing the sentences containing the co-occurrence keywords, respectively inputting a Chinese-YueBERT model, capturing the context semantic features of the text, and calculating the similarity of the semantic features based on the sentences;
step4, fusing the similar information based on the keywords and the semantic features based on the sentences to obtain the similar information of the final text;
the specific steps of Step2 are as follows:
step2.1, extracting keywords in a document by using a keyword extraction algorithm TextRank and obtaining the weight W of the keywords, wherein the weight calculation formula is shown as follows, the algorithm expresses information among the words as a directed weighted graph G (V, E), and V is a point set E and is an edge set;
Figure FDA0003285931930000011
where d is the damping coefficient, WS (V)i) And WS (V)j) Respectively represent words ViAnd VjWeight of (d), In (V)i) And Out (V)j) Respectively represent words ViAnd VjIn and out of, wjiAnd wjkEach represents Vi,VjAnd Vj,VkAn inter-weight;
step2.2, after keyword information of a text is output through a TextRank algorithm, translating the Vietnam text keywords into Chinese through a Translation module, calculating a near meaning word set related to the keywords by using a Chinese synonym tool Synonyms for the translated keywords and the Chinese keywords, and fusing the near meaning words and the keywords of an article to form the Chinese text keyword set and the Vietnam text keyword set;
step2.3, in order to obtain the keyword similarity characteristics of the two documents, obtaining the co-occurrence keywords of the two articles by using the obtained Chinese text keyword word set and Vietnamese text keyword word set, and obtaining the similarity of the keywords of the two articles by using the weight of the co-occurrence keywords to account for the weight of all extracted keywords;
in step2.3, the specific method for determining the similarity between two article keywords by using the weight of the co-occurrence keywords in all the extracted keywords is as follows: the proportion of the keywords which are co-occurring in the two articles in all the extracted keywords is used as text similarity information obtained based on the keywords, and the calculation formula is as follows:
Figure FDA0003285931930000021
wherein WIiAnd WCiRespectively representing the weight of the ith word of the co-occurrence keywords and the weight of the ith word of all the extracted keywords, wherein n is the number of the keywords, and m is the number of the co-occurrence keywords;
in Step 3:
in order to map the sentences or short text paragraphs of the Hanyue to a dense vector space, a Hanyue BERT model capable of capturing the upper and lower semantic information of the Hanyue is trained, and the existing sentence embedding model is expanded to a new language by adopting a knowledge distillation method; source language s is mapped to a dense vector space using teacher model M, with training data being pairs of chinese-crossing parallel sentences ((s)1,t1),...,(sn,tn) Wherein s) isiIs the source language, tiTraining new student models for target languages
Figure FDA0003285931930000022
Make it
Figure FDA0003285931930000023
And
Figure FDA0003285931930000024
this method is called multi-language knowledge distillation learning because students
Figure FDA0003285931930000025
The knowledge of the teacher M is refined, the minimum batch B is given, the mean square loss MSE of the minimum batch B is minimized, and the calculation formula is as follows:
Figure FDA0003285931930000026
student model
Figure FDA0003285931930000027
The system can be provided with the structure and the weight of a teacher model M, and can also be provided with other network system structures with completely different weights, a Chinese BERT model is used as the teacher model, and a student model is a multi-language BERT model;
the trained model is called a Chinese-more BERT model, and the compressed text is input into the trained Chinese-more BERT model and subjected to semantic capture to obtain context semantic features;
in Step3, capturing context semantic features of a text, and calculating semantic feature similarity based on sentences as follows:
inputting the Chinese short text and Vietnamese short text related to the co-occurrence keywords into a Chinese-Vietnamese BERT model respectively to enable the Chinese-Vietnamese short text and the Vietnamese short text to be coded, calculating the cosine distance of two vectors by using cosine similarity of the coded output feature vectors, wherein the calculation formula is as follows:
Figure FDA0003285931930000028
wherein a isiRepresenting Chinese short text S1The ith eigenvalue of the vector of (b)iRepresenting Vietnamese short text S2The ith eigenvalue of the vector of (a);
after obtaining the similar information of the short texts of the Chinese Vietnamese based on the co-occurrence keywords, averaging the similar information to obtain the similarity Sim based on the context semantic information2The calculation formula is shown as follows:
Figure FDA0003285931930000031
wherein FiRepresenting the context semantic similarity of the ith keyword.
2. The method for calculating similarity of Chinese-Yuan text fusing keywords and semantic features according to claim 1, characterized in that: in Step 1:
step1, firstly, performing word segmentation and stop word removal on the Chinese-more-parallel text corpus data, and taking a sequence of words and a sequence of sentences of the text as the input of a downstream model.
3. The method for calculating similarity of Chinese-Yuan text fusing keywords and semantic features according to claim 1, characterized in that: the specific steps of Step1 are as follows:
step1.1, preprocessing the data of the Chinese-Vietnam parallel text corpus, inputting the preprocessed data into a Chinese document and a Vietnam document, wherein the Chinese document and the Vietnam document are respectively split into sequences W of wordsC=(C1,C2,…,Cn),WV=(V1,V2,…,Vn) And a sequence S of sentencesC=(Sc1,Sc2,…,Scn),SV=(Sv1,Sv2,…,Svn) The sequence of words is used as the input of the keyword acquisition layer and is processed to acquire the textsSimilar information of words, and a sentence sequence is used as input of a text compression layer and is processed to obtain text context semantic similar characteristics.
CN202011006911.7A 2020-09-23 2020-09-23 Chinese-Yue text similarity calculation method fusing keywords and semantic features Active CN112257453B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011006911.7A CN112257453B (en) 2020-09-23 2020-09-23 Chinese-Yue text similarity calculation method fusing keywords and semantic features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011006911.7A CN112257453B (en) 2020-09-23 2020-09-23 Chinese-Yue text similarity calculation method fusing keywords and semantic features

Publications (2)

Publication Number Publication Date
CN112257453A CN112257453A (en) 2021-01-22
CN112257453B true CN112257453B (en) 2022-02-22

Family

ID=74231459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011006911.7A Active CN112257453B (en) 2020-09-23 2020-09-23 Chinese-Yue text similarity calculation method fusing keywords and semantic features

Country Status (1)

Country Link
CN (1) CN112257453B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076398B (en) * 2021-03-30 2022-07-29 昆明理工大学 Cross-language information retrieval method based on bilingual dictionary mapping guidance
CN113011194B (en) * 2021-04-15 2022-05-03 电子科技大学 Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN113469977B (en) * 2021-07-06 2024-01-12 浙江霖研精密科技有限公司 Flaw detection device, method and storage medium based on distillation learning mechanism
CN113657125B (en) * 2021-07-14 2023-05-26 内蒙古工业大学 Mongolian non-autoregressive machine translation method based on knowledge graph
CN113901840B (en) * 2021-09-15 2024-04-19 昆明理工大学 Text generation evaluation method based on multi-granularity characteristics
CN114595688B (en) * 2022-01-06 2023-03-10 昆明理工大学 Chinese cross-language word embedding method fusing word cluster constraint
CN114528276B (en) * 2022-02-21 2024-01-19 新疆能源翱翔星云科技有限公司 Big data acquisition, storage and management system and method based on artificial intelligence
CN114707516A (en) * 2022-03-29 2022-07-05 北京理工大学 Long text semantic similarity calculation method based on contrast learning
CN115146629A (en) * 2022-05-10 2022-10-04 昆明理工大学 News text and comment correlation analysis method based on comparative learning
CN114912449B (en) * 2022-07-18 2022-09-30 山东大学 Technical feature keyword extraction method and system based on code description text
CN116680420B (en) * 2023-08-02 2023-10-13 昆明理工大学 Low-resource cross-language text retrieval method and device based on knowledge representation enhancement

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750687B (en) * 2013-12-25 2018-03-20 株式会社东芝 Improve method and device, machine translation method and the device of bilingualism corpora
CN108304390B (en) * 2017-12-15 2020-10-16 腾讯科技(深圳)有限公司 Translation model-based training method, training device, translation method and storage medium
CN109145289A (en) * 2018-07-19 2019-01-04 昆明理工大学 Based on the old-Chinese bilingual sentence similarity calculating method for improving relation vector model
CN109325229B (en) * 2018-09-19 2023-01-31 中译语通科技股份有限公司 Method for calculating text similarity by utilizing semantic information
CN110377918B (en) * 2019-07-15 2020-08-28 昆明理工大学 Chinese-transcendental neural machine translation method fused with syntactic parse tree
CN111581943A (en) * 2020-04-02 2020-08-25 昆明理工大学 Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph

Also Published As

Publication number Publication date
CN112257453A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
CN112257453B (en) Chinese-Yue text similarity calculation method fusing keywords and semantic features
CN107451126B (en) Method and system for screening similar meaning words
CN109766544B (en) Document keyword extraction method and device based on LDA and word vector
CN111061861B (en) Text abstract automatic generation method based on XLNet
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN106610951A (en) Improved text similarity solving algorithm based on semantic analysis
WO2009035863A2 (en) Mining bilingual dictionaries from monolingual web pages
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN110717341B (en) Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
CN106611041A (en) New text similarity solution method
CN111581943A (en) Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN112257460B (en) Pivot-based Hanyue combined training neural machine translation method
CN110929022A (en) Text abstract generation method and system
CN116757188A (en) Cross-language information retrieval training method based on alignment query entity pairs
CN107102986A (en) Multi-threaded keyword extraction techniques in document
CN113157914B (en) Document abstract extraction method and system based on multilayer recurrent neural network
CN108763229B (en) Machine translation method and device based on characteristic sentence stem extraction
Cui Design of intelligent recognition English translation model based on feature extraction algorithm
Jahan et al. Automated text summarization of sinhala online articles
Dan et al. Multi-feature automatic abstract based on LDA model and redundant control
Yao et al. Chinese long text summarization using improved sequence-to-sequence lstm
CN111738022B (en) Machine translation optimization method and system in national defense and military industry field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant