CN110413956B - Text similarity calculation method based on bootstrapping - Google Patents
Text similarity calculation method based on bootstrapping Download PDFInfo
- Publication number
- CN110413956B CN110413956B CN201810400574.6A CN201810400574A CN110413956B CN 110413956 B CN110413956 B CN 110413956B CN 201810400574 A CN201810400574 A CN 201810400574A CN 110413956 B CN110413956 B CN 110413956B
- Authority
- CN
- China
- Prior art keywords
- word
- text
- refers
- words
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text similarity calculation method based on bootstrapping, which comprises the following steps: calculating the reverse document frequency of the word as an initial value of the word weight; selecting an initial core word list according to the reverse document frequency; calculating co-occurrence matrixes of words in the text; according to a bootstrapping algorithm, calculating the correlation degree between the candidate words and the initial core words as a coefficient of updating weight; sentence vectors are calculated from word vector V, word weight W and part-of-speech weight F. By adopting the technical scheme of the invention, the similarity calculation of the short text can be obviously improved.
Description
Technical Field
The invention relates to a word weight calculation method, in particular to a text similarity calculation method based on bootstrapping.
Background
In the present information internet age, a large amount of text information needs to be processed to be effectively utilized. Therefore, the field of natural language processing is continuously evolving. In natural language processing, text is segmented and expressed by word weights, and generating a vector space model is a common processing mode. Many effective methods have been proposed on the word weight calculation method, and the use of tfidf as one of the most commonly used methods.
The bootstarping algorithm is a process that uses limited samples for repeated sampling on a statistical basis. Each iteration generates a new sample to extract a new sample similar to the original sample.
Word vectors refer to mapping each word into a multidimensional vector space rich in the above information by counting the word segmentation in the corpus. The dimension of the word vector can be set according to specific tasks, so that the text information can be conveniently converted into the computable numerical information, and the method plays an important role in natural language processing.
Disclosure of Invention
In order to solve the defect that the traditional idf only judges the weight of the word segmentation from word frequency and neglects the correlation between words, the invention provides a text similarity calculation method based on bootstrapping, which is used for optimizing the weight of the idf word to improve the text similarity.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a text similarity calculation method based on bootstrapping comprises the following steps:
step one, calculating the reverse document frequency of a word as an initial value of word weight;
step two, selecting an initial core word list according to the reverse document frequency;
step three, calculating co-occurrence matrixes of words in the text;
calculating the correlation degree between the candidate words and the initial core words as a coefficient of updating weight according to a bootstrapping algorithm;
and fifthly, calculating sentence vectors according to the word vectors V, the word weights W and the part-of-speech weights F.
Further, the word weight in the first step is represented by a numerical value representing the word in the text, and a real value vector of the text is generated. The more words in text can represent the subject, the lower the weight.
Further, the calculation formula of the correlation in the fourth step is as follows:
wherein S is i Refers to the ith word in the initial core vocabulary S, R j Refers to the jth word, F (S) i ,R j ) Refer to the initial core word S i And candidate word R j Is equal to or greater than the co-occurrence frequency of F (R j ) Meaning that the candidate word R is included j At the frequency of occurrence of the document.
Further, the calculation formula of the coefficient of the update weight in the fourth step is as follows:
wherein max i T(S i ,R j ) The maximum degree of relatedness is indicated, and S is the number of words in the core vocabulary S.
Further, once each iteration in the fourth step, the coefficient is updated once, and the updated calculation formula is as follows:
where j refers to the j-th iteration, n refers to the total number of iterations, K j (C i ) Refers to the word C in the jth iteration i Is used to update the coefficients.
Further, the calculation formula of the sentence vector in the fifth step is as follows:
wherein D is ij Refers to text D i J-th word of (D) i The word "D" means the text i The number of words in the model, and the alpha parameter repeatedly and iteratively searches the optimal value from 0 to 1.
The beneficial effects are that:
1. the invention uses the relativity in bootstarping algorithm for updating the word weight.
2. The invention calculates the correlation degree of a candidate word w and a plurality of initial core words s respectively, and obtains the basis for judging whether the candidate word is put into a new initial core word list or not by maximum value (maximum value-average value).
3. And calculating sentence vectors according to the word vectors V, the word weights W and the part-of-speech weights F.
4. By adopting the technical scheme of the invention, the similarity calculation of the short text can be obviously improved.
Drawings
FIG. 1 is a flow chart of text similarity calculation according to an embodiment of the present invention;
FIG. 2 is a flow chart of updating word segmentation weights according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated below with reference to figures 1,2 and examples.
In the embodiment, the questions of the user in the customer service question-answering system of the e-commerce platform and the standard questions in the knowledge base are used as the corpus D.
As shown in fig. 1 and 2, after word segmentation, stop word removal and part-of-speech tagging are performed on the corpus D, a word2vec algorithm is used to generate a word vector V and the word vector V is stored.
In this embodiment, a user problem or a standard problem in the corpus D is used as a document, and the inverse document frequency of all words is calculated, for example, word C is calculated i Is the inverse document frequency idf (C) i ) The formula is as follows:
where |D| refers to the total number of documents in corpus D, |{ C i E D } | means containing word C i Is a document number of (c). The top 20 ranked words are selected as the initial core vocabulary S and the initial core words are removed from the total vocabulary C to generate the candidate vocabulary R. The initial core vocabulary S is shown in table 1 below.
TABLE 1 initial core vocabulary S
Taking the inverse document frequency of all words in the corpus D as the respective initial weight value w, calculating the relevance T by using bootstrapping algorithm, for example, calculating the initial core word S i And candidate word R j The correlation formula of (2) is as follows:
wherein S is i Refers to the ith word in the initial core vocabulary S, R j Refers to the jth word, F (S) i ,R j ) Refer to the initial core word S i And candidate word R j Is equal to or greater than the co-occurrence frequency of F (R j ) Refers to a bagContaining candidate words R j At the frequency of occurrence of the document. Taking "shipping" and "vacation" as candidate words, taking "me" as initial core word, for example, T (me, shipping) =log2 (13098) =13098/284020=0.63, T (me, vacation) =log2 (28) ×28/896=0.15. And extracting a new initial core word according to the relevance threshold p, adding the new initial core word into the initial core word list S, removing the new initial core word from the candidate word list R, and performing the next iteration until the new initial core word does not appear any more. The setting of the correlation threshold p depends on the calculation of the correlation T, and in order to control the number of new initial core words generated by each iteration, the value range of p is controlled to be [0.8,1 ]]Between them.
In the iterative process, according to the candidate word R j And each initial core word S i Is related to T to obtain candidate word R j Update coefficient of weight K (R j ) The coefficient calculation formula of the update weight is as follows:
wherein max i T(S i ,R j ) Refers to the maximum degree of relatedness, |s|, refers to the number of words in the initial set of core words S. Taking "shipping" and "vacation" as examples, K (shipping) =1.878×1.878-1.218) =1.24, K (vacation) =1.167×1.167-0.992) =0.2. The updating weight coefficient K of the candidate word is controlled to be [1, 2]]Within a range of (2) the update coefficient of the initial core word defaults to 2.
Updating word weights using idf and K, e.g. word C i The updated calculation formula is as follows:
where j refers to the j-th iteration, n refers to the total number of iterations, K j (C i ) Refers to the word C in the jth iteration i Is used to update the coefficients. Taking "shipping" and "holiday" as examples, W (shipping) =1/(0.369×1.125×1.011×1.001)/7=0.339, W (leave) =1/(0.112×1.016×1.006×0.115)/7=1.235
The value range is controlled to be 0.2-2 according to the increment of the part of speech by 0.1, and the obtained part of speech weight vocabulary is F, as shown in the following table 2.
TABLE 2 part-of-speech weight vocabulary
And calculating a vector M of the text, and comparing the similarity by using cosine similarity. For example text D i The sentence vector calculation formula of (2) is as follows:
wherein D is ij Refers to text D i J-th word of (D) i The word "D" means the text i The number of words in the model, and the alpha parameter repeatedly and iteratively searches the optimal value from 0 to 1. The test results are shown in table 3 below:
table 3 comparison of test results
And calculating the similarity between the text vectors by using the cosine similarity, and taking the standard question with the highest similarity as a correct target result.
It should be apparent to those skilled in the art that various modifications or variations can be made in the present invention without requiring any inventive effort by those skilled in the art based on the technical solutions of the present invention.
Claims (3)
1. A text similarity calculation method based on bootstrapping is characterized by comprising the following steps:
step one, calculating the reverse document frequency of a word as an initial value of word weight;
step two, selecting an initial core word list according to the reverse document frequency;
step three, calculating co-occurrence matrixes of words in the text;
calculating the correlation degree between the candidate words and the initial core words as a coefficient of updating weight according to a bootstrapping algorithm;
step five, calculating sentence vectors according to the word vectors V, the word weights W and the part-of-speech weights F;
the calculation formula of the correlation degree in the fourth step is as follows:
wherein T (S) i ,R j ) Refer to the initial core word S i And candidate word R j Correlation of S i Refers to the ith word in the initial core vocabulary S, R j Refers to the jth word, F (S) i ,R j ) Refer to the initial core word S i And candidate word R j Is equal to or greater than the co-occurrence frequency of F (R j ) Meaning that the candidate word R is included j Frequency of occurrence in documents;
the calculation formula of the coefficient of the update weight in the fourth step is as follows:
wherein K (R) j ) Refers to candidate word R j The update coefficients of the weights are used to update the weights,refers to the maximum correlation, and |S| refers to the kernelThe number of words in the heart word list S;
and in the fourth step, updating the update coefficient once every iteration, wherein the update calculation formula of the word weight is as follows:
wherein W (Ci) refers to the word C i J means the jth iteration, n means the total number of iterations, K j (C i ) Refers to the word C in the jth iteration i Update coefficients of the weights.
2. The text similarity calculation method based on bootstrapping of claim 1, wherein the word weights in the step one are represented by a numerical value representing words in the text, and a real-valued vector of the text is generated.
3. The text similarity calculation method based on bootstrapping according to claim 1, wherein the calculation formula of the sentence vector in the fifth step is as follows:
wherein D is ij Refers to text D i J-th word of (D) i The word "D" means the text i The number of words in the model, and the alpha parameter repeatedly and iteratively searches the optimal value from 0 to 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810400574.6A CN110413956B (en) | 2018-04-28 | 2018-04-28 | Text similarity calculation method based on bootstrapping |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810400574.6A CN110413956B (en) | 2018-04-28 | 2018-04-28 | Text similarity calculation method based on bootstrapping |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110413956A CN110413956A (en) | 2019-11-05 |
CN110413956B true CN110413956B (en) | 2023-08-01 |
Family
ID=68357008
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810400574.6A Active CN110413956B (en) | 2018-04-28 | 2018-04-28 | Text similarity calculation method based on bootstrapping |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110413956B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111159404B (en) * | 2019-12-27 | 2023-09-19 | 海尔优家智能科技(北京)有限公司 | Text classification method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010198141A (en) * | 2009-02-23 | 2010-09-09 | Rakuten Inc | Device, method and program for preparing database in which phrase included in document classified by category |
CN106156196A (en) * | 2015-04-22 | 2016-11-23 | 富士通株式会社 | Extract the apparatus and method of text feature |
WO2017084267A1 (en) * | 2015-11-18 | 2017-05-26 | 乐视控股(北京)有限公司 | Method and device for keyphrase extraction |
CN107153658A (en) * | 2016-03-03 | 2017-09-12 | 常州普适信息科技有限公司 | A kind of public sentiment hot word based on weighted keyword algorithm finds method |
CN107220237A (en) * | 2017-05-24 | 2017-09-29 | 南京大学 | A kind of method of business entity's Relation extraction based on convolutional neural networks |
CN107862620A (en) * | 2017-12-11 | 2018-03-30 | 四川新网银行股份有限公司 | A kind of similar users method for digging based on social data |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170139899A1 (en) * | 2015-11-18 | 2017-05-18 | Le Holdings (Beijing) Co., Ltd. | Keyword extraction method and electronic device |
-
2018
- 2018-04-28 CN CN201810400574.6A patent/CN110413956B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010198141A (en) * | 2009-02-23 | 2010-09-09 | Rakuten Inc | Device, method and program for preparing database in which phrase included in document classified by category |
CN106156196A (en) * | 2015-04-22 | 2016-11-23 | 富士通株式会社 | Extract the apparatus and method of text feature |
WO2017084267A1 (en) * | 2015-11-18 | 2017-05-26 | 乐视控股(北京)有限公司 | Method and device for keyphrase extraction |
CN107153658A (en) * | 2016-03-03 | 2017-09-12 | 常州普适信息科技有限公司 | A kind of public sentiment hot word based on weighted keyword algorithm finds method |
CN107220237A (en) * | 2017-05-24 | 2017-09-29 | 南京大学 | A kind of method of business entity's Relation extraction based on convolutional neural networks |
CN107862620A (en) * | 2017-12-11 | 2018-03-30 | 四川新网银行股份有限公司 | A kind of similar users method for digging based on social data |
Non-Patent Citations (1)
Title |
---|
本体学习及其在语义检索中应用的研究;刘婷;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120415;第1-55页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110413956A (en) | 2019-11-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106997376B (en) | Question and answer sentence similarity calculation method based on multi-level features | |
CN109408526B (en) | SQL sentence generation method, device, computer equipment and storage medium | |
CN107480143B (en) | Method and system for segmenting conversation topics based on context correlation | |
CN109670191B (en) | Calibration optimization method and device for machine translation and electronic equipment | |
CN111427995B (en) | Semantic matching method, device and storage medium based on internal countermeasure mechanism | |
CN111159359B (en) | Document retrieval method, device and computer readable storage medium | |
CN110929038B (en) | Knowledge graph-based entity linking method, device, equipment and storage medium | |
US20100153315A1 (en) | Boosting algorithm for ranking model adaptation | |
CN112667794A (en) | Intelligent question-answer matching method and system based on twin network BERT model | |
US11461613B2 (en) | Method and apparatus for multi-document question answering | |
CN108475262A (en) | Electronic equipment and method for text-processing | |
CN108021555A (en) | A kind of Question sentence parsing measure based on depth convolutional neural networks | |
CN110929498B (en) | Method and device for calculating similarity of short text and readable storage medium | |
CN112115716A (en) | Service discovery method, system and equipment based on multi-dimensional word vector context matching | |
CN109766547B (en) | Sentence similarity calculation method | |
CN112434134B (en) | Search model training method, device, terminal equipment and storage medium | |
CN113220864B (en) | Intelligent question-answering data processing system | |
CN111966810A (en) | Question-answer pair ordering method for question-answer system | |
CN110347833B (en) | Classification method for multi-round conversations | |
WO2020100738A1 (en) | Processing device, processing method, and processing program | |
CN110728135A (en) | Text theme indexing method and device, electronic equipment and computer storage medium | |
CN110413956B (en) | Text similarity calculation method based on bootstrapping | |
CN117076636A (en) | Information query method, system and equipment for intelligent customer service | |
CN116910599A (en) | Data clustering method, system, electronic equipment and storage medium | |
CN110442681A (en) | A kind of machine reads method, electronic equipment and the readable storage medium storing program for executing of understanding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |