CN110413956B - Text similarity calculation method based on bootstrapping - Google Patents

Text similarity calculation method based on bootstrapping Download PDF

Info

Publication number
CN110413956B
CN110413956B CN201810400574.6A CN201810400574A CN110413956B CN 110413956 B CN110413956 B CN 110413956B CN 201810400574 A CN201810400574 A CN 201810400574A CN 110413956 B CN110413956 B CN 110413956B
Authority
CN
China
Prior art keywords
word
text
refers
words
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810400574.6A
Other languages
Chinese (zh)
Other versions
CN110413956A (en
Inventor
王清琛
杜振东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Yunwen Network Technology Co ltd
Original Assignee
Nanjing Yunwen Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Yunwen Network Technology Co ltd filed Critical Nanjing Yunwen Network Technology Co ltd
Priority to CN201810400574.6A priority Critical patent/CN110413956B/en
Publication of CN110413956A publication Critical patent/CN110413956A/en
Application granted granted Critical
Publication of CN110413956B publication Critical patent/CN110413956B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text similarity calculation method based on bootstrapping, which comprises the following steps: calculating the reverse document frequency of the word as an initial value of the word weight; selecting an initial core word list according to the reverse document frequency; calculating co-occurrence matrixes of words in the text; according to a bootstrapping algorithm, calculating the correlation degree between the candidate words and the initial core words as a coefficient of updating weight; sentence vectors are calculated from word vector V, word weight W and part-of-speech weight F. By adopting the technical scheme of the invention, the similarity calculation of the short text can be obviously improved.

Description

Text similarity calculation method based on bootstrapping
Technical Field
The invention relates to a word weight calculation method, in particular to a text similarity calculation method based on bootstrapping.
Background
In the present information internet age, a large amount of text information needs to be processed to be effectively utilized. Therefore, the field of natural language processing is continuously evolving. In natural language processing, text is segmented and expressed by word weights, and generating a vector space model is a common processing mode. Many effective methods have been proposed on the word weight calculation method, and the use of tfidf as one of the most commonly used methods.
The bootstarping algorithm is a process that uses limited samples for repeated sampling on a statistical basis. Each iteration generates a new sample to extract a new sample similar to the original sample.
Word vectors refer to mapping each word into a multidimensional vector space rich in the above information by counting the word segmentation in the corpus. The dimension of the word vector can be set according to specific tasks, so that the text information can be conveniently converted into the computable numerical information, and the method plays an important role in natural language processing.
Disclosure of Invention
In order to solve the defect that the traditional idf only judges the weight of the word segmentation from word frequency and neglects the correlation between words, the invention provides a text similarity calculation method based on bootstrapping, which is used for optimizing the weight of the idf word to improve the text similarity.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a text similarity calculation method based on bootstrapping comprises the following steps:
step one, calculating the reverse document frequency of a word as an initial value of word weight;
step two, selecting an initial core word list according to the reverse document frequency;
step three, calculating co-occurrence matrixes of words in the text;
calculating the correlation degree between the candidate words and the initial core words as a coefficient of updating weight according to a bootstrapping algorithm;
and fifthly, calculating sentence vectors according to the word vectors V, the word weights W and the part-of-speech weights F.
Further, the word weight in the first step is represented by a numerical value representing the word in the text, and a real value vector of the text is generated. The more words in text can represent the subject, the lower the weight.
Further, the calculation formula of the correlation in the fourth step is as follows:
wherein S is i Refers to the ith word in the initial core vocabulary S, R j Refers to the jth word, F (S) i ,R j ) Refer to the initial core word S i And candidate word R j Is equal to or greater than the co-occurrence frequency of F (R j ) Meaning that the candidate word R is included j At the frequency of occurrence of the document.
Further, the calculation formula of the coefficient of the update weight in the fourth step is as follows:
wherein max i T(S i ,R j ) The maximum degree of relatedness is indicated, and S is the number of words in the core vocabulary S.
Further, once each iteration in the fourth step, the coefficient is updated once, and the updated calculation formula is as follows:
where j refers to the j-th iteration, n refers to the total number of iterations, K j (C i ) Refers to the word C in the jth iteration i Is used to update the coefficients.
Further, the calculation formula of the sentence vector in the fifth step is as follows:
wherein D is ij Refers to text D i J-th word of (D) i The word "D" means the text i The number of words in the model, and the alpha parameter repeatedly and iteratively searches the optimal value from 0 to 1.
The beneficial effects are that:
1. the invention uses the relativity in bootstarping algorithm for updating the word weight.
2. The invention calculates the correlation degree of a candidate word w and a plurality of initial core words s respectively, and obtains the basis for judging whether the candidate word is put into a new initial core word list or not by maximum value (maximum value-average value).
3. And calculating sentence vectors according to the word vectors V, the word weights W and the part-of-speech weights F.
4. By adopting the technical scheme of the invention, the similarity calculation of the short text can be obviously improved.
Drawings
FIG. 1 is a flow chart of text similarity calculation according to an embodiment of the present invention;
FIG. 2 is a flow chart of updating word segmentation weights according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated below with reference to figures 1,2 and examples.
In the embodiment, the questions of the user in the customer service question-answering system of the e-commerce platform and the standard questions in the knowledge base are used as the corpus D.
As shown in fig. 1 and 2, after word segmentation, stop word removal and part-of-speech tagging are performed on the corpus D, a word2vec algorithm is used to generate a word vector V and the word vector V is stored.
In this embodiment, a user problem or a standard problem in the corpus D is used as a document, and the inverse document frequency of all words is calculated, for example, word C is calculated i Is the inverse document frequency idf (C) i ) The formula is as follows:
where |D| refers to the total number of documents in corpus D, |{ C i E D } | means containing word C i Is a document number of (c). The top 20 ranked words are selected as the initial core vocabulary S and the initial core words are removed from the total vocabulary C to generate the candidate vocabulary R. The initial core vocabulary S is shown in table 1 below.
TABLE 1 initial core vocabulary S
Taking the inverse document frequency of all words in the corpus D as the respective initial weight value w, calculating the relevance T by using bootstrapping algorithm, for example, calculating the initial core word S i And candidate word R j The correlation formula of (2) is as follows:
wherein S is i Refers to the ith word in the initial core vocabulary S, R j Refers to the jth word, F (S) i ,R j ) Refer to the initial core word S i And candidate word R j Is equal to or greater than the co-occurrence frequency of F (R j ) Refers to a bagContaining candidate words R j At the frequency of occurrence of the document. Taking "shipping" and "vacation" as candidate words, taking "me" as initial core word, for example, T (me, shipping) =log2 (13098) =13098/284020=0.63, T (me, vacation) =log2 (28) ×28/896=0.15. And extracting a new initial core word according to the relevance threshold p, adding the new initial core word into the initial core word list S, removing the new initial core word from the candidate word list R, and performing the next iteration until the new initial core word does not appear any more. The setting of the correlation threshold p depends on the calculation of the correlation T, and in order to control the number of new initial core words generated by each iteration, the value range of p is controlled to be [0.8,1 ]]Between them.
In the iterative process, according to the candidate word R j And each initial core word S i Is related to T to obtain candidate word R j Update coefficient of weight K (R j ) The coefficient calculation formula of the update weight is as follows:
wherein max i T(S i ,R j ) Refers to the maximum degree of relatedness, |s|, refers to the number of words in the initial set of core words S. Taking "shipping" and "vacation" as examples, K (shipping) =1.878×1.878-1.218) =1.24, K (vacation) =1.167×1.167-0.992) =0.2. The updating weight coefficient K of the candidate word is controlled to be [1, 2]]Within a range of (2) the update coefficient of the initial core word defaults to 2.
Updating word weights using idf and K, e.g. word C i The updated calculation formula is as follows:
where j refers to the j-th iteration, n refers to the total number of iterations, K j (C i ) Refers to the word C in the jth iteration i Is used to update the coefficients. Taking "shipping" and "holiday" as examples, W (shipping) =1/(0.369×1.125×1.011×1.001)/7=0.339, W (leave) =1/(0.112×1.016×1.006×0.115)/7=1.235
The value range is controlled to be 0.2-2 according to the increment of the part of speech by 0.1, and the obtained part of speech weight vocabulary is F, as shown in the following table 2.
TABLE 2 part-of-speech weight vocabulary
And calculating a vector M of the text, and comparing the similarity by using cosine similarity. For example text D i The sentence vector calculation formula of (2) is as follows:
wherein D is ij Refers to text D i J-th word of (D) i The word "D" means the text i The number of words in the model, and the alpha parameter repeatedly and iteratively searches the optimal value from 0 to 1. The test results are shown in table 3 below:
table 3 comparison of test results
And calculating the similarity between the text vectors by using the cosine similarity, and taking the standard question with the highest similarity as a correct target result.
It should be apparent to those skilled in the art that various modifications or variations can be made in the present invention without requiring any inventive effort by those skilled in the art based on the technical solutions of the present invention.

Claims (3)

1. A text similarity calculation method based on bootstrapping is characterized by comprising the following steps:
step one, calculating the reverse document frequency of a word as an initial value of word weight;
step two, selecting an initial core word list according to the reverse document frequency;
step three, calculating co-occurrence matrixes of words in the text;
calculating the correlation degree between the candidate words and the initial core words as a coefficient of updating weight according to a bootstrapping algorithm;
step five, calculating sentence vectors according to the word vectors V, the word weights W and the part-of-speech weights F;
the calculation formula of the correlation degree in the fourth step is as follows:
wherein T (S) i ,R j ) Refer to the initial core word S i And candidate word R j Correlation of S i Refers to the ith word in the initial core vocabulary S, R j Refers to the jth word, F (S) i ,R j ) Refer to the initial core word S i And candidate word R j Is equal to or greater than the co-occurrence frequency of F (R j ) Meaning that the candidate word R is included j Frequency of occurrence in documents;
the calculation formula of the coefficient of the update weight in the fourth step is as follows:
wherein K (R) j ) Refers to candidate word R j The update coefficients of the weights are used to update the weights,refers to the maximum correlation, and |S| refers to the kernelThe number of words in the heart word list S;
and in the fourth step, updating the update coefficient once every iteration, wherein the update calculation formula of the word weight is as follows:
wherein W (Ci) refers to the word C i J means the jth iteration, n means the total number of iterations, K j (C i ) Refers to the word C in the jth iteration i Update coefficients of the weights.
2. The text similarity calculation method based on bootstrapping of claim 1, wherein the word weights in the step one are represented by a numerical value representing words in the text, and a real-valued vector of the text is generated.
3. The text similarity calculation method based on bootstrapping according to claim 1, wherein the calculation formula of the sentence vector in the fifth step is as follows:
wherein D is ij Refers to text D i J-th word of (D) i The word "D" means the text i The number of words in the model, and the alpha parameter repeatedly and iteratively searches the optimal value from 0 to 1.
CN201810400574.6A 2018-04-28 2018-04-28 Text similarity calculation method based on bootstrapping Active CN110413956B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810400574.6A CN110413956B (en) 2018-04-28 2018-04-28 Text similarity calculation method based on bootstrapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810400574.6A CN110413956B (en) 2018-04-28 2018-04-28 Text similarity calculation method based on bootstrapping

Publications (2)

Publication Number Publication Date
CN110413956A CN110413956A (en) 2019-11-05
CN110413956B true CN110413956B (en) 2023-08-01

Family

ID=68357008

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810400574.6A Active CN110413956B (en) 2018-04-28 2018-04-28 Text similarity calculation method based on bootstrapping

Country Status (1)

Country Link
CN (1) CN110413956B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159404B (en) * 2019-12-27 2023-09-19 海尔优家智能科技(北京)有限公司 Text classification method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010198141A (en) * 2009-02-23 2010-09-09 Rakuten Inc Device, method and program for preparing database in which phrase included in document classified by category
CN106156196A (en) * 2015-04-22 2016-11-23 富士通株式会社 Extract the apparatus and method of text feature
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN107153658A (en) * 2016-03-03 2017-09-12 常州普适信息科技有限公司 A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN107220237A (en) * 2017-05-24 2017-09-29 南京大学 A kind of method of business entity's Relation extraction based on convolutional neural networks
CN107862620A (en) * 2017-12-11 2018-03-30 四川新网银行股份有限公司 A kind of similar users method for digging based on social data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170139899A1 (en) * 2015-11-18 2017-05-18 Le Holdings (Beijing) Co., Ltd. Keyword extraction method and electronic device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010198141A (en) * 2009-02-23 2010-09-09 Rakuten Inc Device, method and program for preparing database in which phrase included in document classified by category
CN106156196A (en) * 2015-04-22 2016-11-23 富士通株式会社 Extract the apparatus and method of text feature
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN107153658A (en) * 2016-03-03 2017-09-12 常州普适信息科技有限公司 A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN107220237A (en) * 2017-05-24 2017-09-29 南京大学 A kind of method of business entity's Relation extraction based on convolutional neural networks
CN107862620A (en) * 2017-12-11 2018-03-30 四川新网银行股份有限公司 A kind of similar users method for digging based on social data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
本体学习及其在语义检索中应用的研究;刘婷;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120415;第1-55页 *

Also Published As

Publication number Publication date
CN110413956A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN106997376B (en) Question and answer sentence similarity calculation method based on multi-level features
CN109408526B (en) SQL sentence generation method, device, computer equipment and storage medium
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN109670191B (en) Calibration optimization method and device for machine translation and electronic equipment
CN111427995B (en) Semantic matching method, device and storage medium based on internal countermeasure mechanism
CN111159359B (en) Document retrieval method, device and computer readable storage medium
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
US20100153315A1 (en) Boosting algorithm for ranking model adaptation
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
US11461613B2 (en) Method and apparatus for multi-document question answering
CN108475262A (en) Electronic equipment and method for text-processing
CN108021555A (en) A kind of Question sentence parsing measure based on depth convolutional neural networks
CN110929498B (en) Method and device for calculating similarity of short text and readable storage medium
CN112115716A (en) Service discovery method, system and equipment based on multi-dimensional word vector context matching
CN109766547B (en) Sentence similarity calculation method
CN112434134B (en) Search model training method, device, terminal equipment and storage medium
CN113220864B (en) Intelligent question-answering data processing system
CN111966810A (en) Question-answer pair ordering method for question-answer system
CN110347833B (en) Classification method for multi-round conversations
WO2020100738A1 (en) Processing device, processing method, and processing program
CN110728135A (en) Text theme indexing method and device, electronic equipment and computer storage medium
CN110413956B (en) Text similarity calculation method based on bootstrapping
CN117076636A (en) Information query method, system and equipment for intelligent customer service
CN116910599A (en) Data clustering method, system, electronic equipment and storage medium
CN110442681A (en) A kind of machine reads method, electronic equipment and the readable storage medium storing program for executing of understanding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant