CN108804595B - Short text representation method based on word2vec - Google Patents

Short text representation method based on word2vec Download PDF

Info

Publication number
CN108804595B
CN108804595B CN201810525103.8A CN201810525103A CN108804595B CN 108804595 B CN108804595 B CN 108804595B CN 201810525103 A CN201810525103 A CN 201810525103A CN 108804595 B CN108804595 B CN 108804595B
Authority
CN
China
Prior art keywords
document
words
word
similar words
cosine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810525103.8A
Other languages
Chinese (zh)
Other versions
CN108804595A (en
Inventor
路永和
张炜婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201810525103.8A priority Critical patent/CN108804595B/en
Publication of CN108804595A publication Critical patent/CN108804595A/en
Application granted granted Critical
Publication of CN108804595B publication Critical patent/CN108804595B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a word2 vec-based short text representation method, which comprises the following steps of: s1: inputting a training text set subjected to text preprocessing, setting word2vec method parameters, and training to obtain a word vector set corresponding to the training text set; s2: calculating the cosine distance between word vectors for each word in each document to obtain a series of similar words of the word in the whole training text set; s3: calculating the cosine distance between the similar words in each document and the document; s4: sorting according to the cosine distances from big to small, and finally selecting the first n similar words and the corresponding cosine distances to form n similar words and cosine measurement of the document; s5: and calculating the weights of the words in the documents and the selected n similar words in the documents to form new text representation, and outputting the vector space representation of each document after improvement based on word2 vec.

Description

Short text representation method based on word2vec
Technical Field
The invention relates to the field of computer science and technology, in particular to a word2 vec-based short text representation method.
Background
In the text mining process, the machine for reading the sample information needs to firstly go through a text representation link and convert the sample into a numerical value. With the continuous expansion of natural language processing range and the development of computer technology, how to use numerical values to better represent semantic information represented by texts is always one of the crucial research points in the field of text processing, because the semantic information directly influences text mining effect. For the problem of short text mining, an effective text feature representation method is a difficult point of research, and particularly, short texts generated by a social platform have the traditional problems of feature sparseness, incomplete semantics, one-word-multiple meaning, multiple-word-one meaning and the like, and also have the characteristics of random expression, misuse of new words, large quantity and the like.
Commonly used text representation models are boolean models, probabilistic models, and Vector Space models, with the most commonly used text representation Model being the Vector Space Model (Vector Space Model) proposed by Gerard slide et al in 1958. The basic idea of the vector space model is to represent a text by using vectors, i.e. selecting partial feature words from a training set, and then using each feature word as one dimension of a vector space coordinate system, so that the text is formed into a vector in a multi-dimensional vector space, wherein each text is a point in an n-dimensional space, and similarity between the texts can be measured by an included angle between vectors or a distance between vectors (Tai De Yi, Wang. text classification feature weight improvement algorithm [ J ] computer engineering, 2010,36(9): 197-. However, the vector space model has the defect that the data space expression is sparse and the semantic information between words is ignored, which results in a slightly weaker representation capability of the vector space model on short texts. Some researchers have attempted to correct these defects, such as Wang B K, etc. by proposing a string feature the result OF latent dirichlet allocation and information gain (SFT), which combines LDA and IG to increase the weight OF vocabulary, thereby selecting a feature word with stronger semantic information (Wang B K, Huang Y F, Yang W X, et al. short text classification based on string feature the result OF the feature [ J ]. joural OF zhejiangg property-SCIENCE C-components & electrorics, 2012,13(9): 649-. Yang Lili et al propose a Semantic extension method by combining words and Semantic Features of Short Text, which utilizes Wikipedia as a background knowledge base to obtain Semantic Features of words, and recalculates feature word weights based on combinations of words and semantics (Yang L, Li C, Ding Q, et al. combining lessical and Semantic Features for Short Text Classification [ J ]. Procedia Computer Science,2013,22(0): 78-86.).
In 2013, Google's Tomas Mikolov team issued an open source word vector generation tool based on deep learning — word2vec (Mikolov T, Le Q V, Sutskeeper I. expanding criteria animals sources for machine translation [ J ]. arXiv prediction arXiv:1309.4168,2013.Mikolov T, Chen K, Corrado G, et al. efficiency estimation of word reproducibility in vector space [ J ]. arXiv prediction arXiv:1301.3781,2013.). The algorithm can learn high-quality word vectors from a large-scale real document corpus in a short time and is used for conveniently calculating semantic similarity between words. word2vec can not only find semantic information among words, but also provide a new solution for the problem that a vector space model is sparse in short text expression.
Disclosure of Invention
The invention aims to provide a short text representation method based on word2vec aiming at the problems of data space sparseness and semantic missing of a Vector Space Model (VSM), and a knowledge theme can be better extracted by using a clustering result of short texts represented by the short text representation method based on word2 vec.
In order to realize the purpose, the technical scheme is as follows:
a short text representation method based on word2vec comprises the following steps:
s1: inputting a training text set subjected to text preprocessing, setting word2vec method parameters, and training to obtain a word vector set corresponding to the training text set;
s2: calculating the cosine distance between word vectors for each word in each document to obtain a series of similar words of the word in the whole training text set;
s3: calculating the cosine distance between the similar words in each document and the document;
s4: sorting according to the cosine distances from big to small, and finally selecting the first n similar words and the corresponding cosine distances to form n similar words and cosine measurement of the document;
s5: and calculating the weights of the words in the documents and the selected n similar words in the documents to form new text representation, and outputting the vector space representation of each document after improvement based on word2 vec.
Preferably, the preprocessing procedure of the training text set in step S1 includes:
s1.1: constructing a user dictionary to perform word segmentation processing and part-of-speech tagging on the training text;
s1.2: removing stop words according to the existing stop word list, and removing pronouns, prepositions and orientation words according to the part of speech;
s1.3: and the feature selection is carried out by adopting methods such as TF, IDF or TF-IDF and the like, so that the feature dimension is reduced.
Preferably, the specific calculation process of step S3 is as follows:
if some words in the document have consistent similar words, the cosine distances of the consistent similar words are added to form the cosine distance between the similar words and the document, otherwise, the original similar words and the cosine distances between the original similar words and the words in the document are kept:
s(t,d)=s(t,t1)+s(t,t2)+s(t,t3)+…+s(t,tn) (1)
wherein, t1,t2,t3,…,tnIs the vocabulary in the document d, s (t, t)n) Representing words t and words t in documents dnS (t, d) represents the cosine measure of the word t and the document d.
Preferably, the specific process of calculating the weight of the word in the document and the selected n similar words in the document in the step S5 is as follows:
Figure BDA0001675721850000031
w (t, nd) is the weight of the word t in the document nd added with n adjacent words, and is obtained by a characteristic weight calculation method TF-IDF; s (t, d) represents the cosine measure of the word t and the document d.
Compared with the prior art, the invention has the beneficial effects that:
(1) the invention provides a short text representation method based on word2vec, wherein the word2vec is used for finding out similar words of each word in a text, and then the similar words of the text are obtained through calculation and are used as the expansion of the characteristics of the text in a vector space model.
(2) The experimental result shows that the short text representation method based on word2vec has the performance which is obviously superior to that of the traditional vector space model in the text clustering link and the text classification link of the experiment, the average value of the DB _ index of the clustering link is reduced by 0.704, the average value of the classification accuracy of the classification link is improved by 4.614%, and the short text representation method based on word2vec improves the clustering effect in two aspects of technology and application and can better extract the knowledge subject in the corpus.
Drawings
FIG. 1 Process for representing short text by word2 vec-based vector space model improvement method
FIG. 2 is a DB _ index line graph of text represented by a traditional vector space model method and changing with feature dimensions under different clustering numbers
FIG. 3 is a DB _ index line graph showing text variation with feature dimensions under different cluster numbers based on the method of the present invention
FIG. 4 is a histogram of clustered DB _ index values of a text represented by a conventional vector space model method and the method of the present invention
FIG. 5 is a histogram of classification accuracy of text as a function of feature dimensions based on a conventional vector space model method and methods described herein
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the invention is further illustrated below with reference to the figures and examples.
Example 1
The present invention and other features and advantages thereof will be described in more detail below with reference to the accompanying drawings, in which a comprehensive two-child policy short text corpus is taken as an example.
The acquisition and preprocessing process of the training text set is as follows:
a comprehensive two-child policy short text corpus used for experiments is obtained by crawling a Xinlang microblog, and 102300 pieces of available data are obtained as an experiment corpus after necessary cleaning and filtering are carried out on the captured data. After Chinese segmentation and part-of-speech tagging of a text are completed by using a word segmentation system Java edition of NLPIR2016, a work-in-the-great stop word list is imported to remove stop words, and meanwhile, words without practical meanings such as pronouns, prepositions, azimuth words and the like are removed according to the part-of-speech. In the feature selection process, the unsupervised feature selection methods mainly included at present are TF, IDF and TF-IDF, and the TF-IDF method is selected for feature selection in the embodiment, so that the feature dimension is reduced.
(1) Short text presentation process
As shown in fig. 1, a process of representing a short text by using a word2 vec-based short text representation method includes the following specific steps:
s1: before a word2 vec-based short text representation method is adopted, a Google word2vec open source tool is used for generating a word vector document for input data in a Linux environment. The text data after the text preprocessing of the training set is used as a data set for word2vec word vector generation, parameters are set to be-cbow 0-size 200-window 5-negative 0-hs 1-sample 1e-3-threads 12-binary 1, a Skip-Gram model is used, the size of a training window is 5, and a 200-dimensional word vector is generated.
S2: a series of similar words of the words in the text in the whole training text set are obtained by utilizing a word2vec method according to word vector calculation, and a table 1 shows the words in one text, the corresponding similar words and cosine distance values of the words in the one text.
Table 1 approximate words and cosine distances of partial words obtained by word2vec method
Figure BDA0001675721850000051
S3: and (3) calculating the cosine distance between the similar words of each document and the document according to the formula (1).
S4: the selection of the number n of the similar words of the document is considered, if the value of n is too small, the number of the similar words considered to participate in calculation of each document after the feature selection is too small; if the value of n is too large, the calculation amount and the running time of a text representation link are greatly increased, the invention sets n to be 50, namely, the first 50 similar words and the corresponding cosine distance of the document are selected as the expansion characteristics of the document.
S5: and (3) calculating the weights of the words in the documents and the selected n similar words in the documents according to the formula (2), forming a new text representation, and outputting a vector space representation of each document after improvement based on word2 vec.
(2) Evaluation method
And respectively calculating DB _ index of different feature dimensions selected by the documents represented by the conventional method and the word2 vec-based short text representation method and determining the cluster number by seeking the minimum DB _ index by using a K-means clustering method.
A DB _ index value-taking line graph that varies with feature dimensions based on different clustering numbers in the conventional vector space model method is shown in fig. 2. A graph showing the change of the DB _ index value along with the feature dimension under different clustering numbers based on the word2vec short text representation method is shown in FIG. 3.
As can be seen from fig. 2 and 3, no matter the conventional vector space model method or the word2 vec-based short text representation method is used, when the clustering number is 13, the intra-class dispersion and the inter-class separation of the feature dimensions remain relatively stable, the classification is relatively stable, and the DB _ index has the minimum value, so that the clustering number 13 is selected as the optimal clustering number.
(1)DB_index
Figure BDA0001675721850000061
Where k is the number of clusters, dijIs the distance between two class centers, SiIs the average distance of the samples in class i to the center of the class.
Because the optimal clustering effect is obtained when the clustering number is 13, the clustering effects of the two text representation methods are compared when the clustering number is 13. FIG. 4 is a DB-index value histogram of two text representation methods under different feature dimensions when the number of clusters is 13. As can be seen from FIG. 4, when the number of clusters is 13 and the feature dimension is between 200 and 2000, the word2 vec-based short text representation method can obtain a lower DB-index value than the conventional vector space model method. The text representation is performed by using a word2 vec-based short text representation method, so that the text can be better represented, and the text can obtain smaller intra-cluster aggregation degree and larger inter-cluster separation degree in the clustering process.
(2) Interpretation of clustering results
Considering that DB _ index has a minimum value of 1.168 when the feature dimension of the word2 vec-based short text representation method is 200, the clustering result at this time is explained as shown in table 2.
As can be seen from table 2, after the overall policy for the second child is opened, there are direct civil problems to be paid attention to and solved, including category 1 educational medicine, category 4 late marriage and late childbirth, and category 11 female employment, which are related problems directly fed back by the public at the first time after the overall policy for the second child is opened, and these problems are negative effects brought by the overall policy for the second child, and should cause related units to pay attention and take corresponding measures. Meanwhile, the categories 2, 6 and 9 represent the economic and life pressure burden brought by the comprehensive policy of the second-class children, people need to improve the personal income level and the life quality more to consider the opportunity of the second-class children, so that the government is forced to implement more welfare guarantee systems, provide more comprehensive employment and other means for promoting the income level of the people, and otherwise, the people are forced to have real life pressure despite opening the comprehensive second-class children, the general fertility will of the people is not high, and the aging problem of the population cannot be relieved. The categories 3, 8 and 13 mainly relate to the problems in the family which may be brought about by the general girl, and although the contents have no direct relation with the contents such as policy enforcement, the contents are worried about by each people when considering whether to respond to the general girl policy, and the people are believed to have proper practice and judgment. The categories 5, 10 and 12 mainly relate to the opinion and feelings of the public on the comprehensive policy of the second child, and most of the three categories represent the support and expectation of the public on the comprehensive policy of the second child, so that the comprehensive policy can be concluded or the requirements of the public can be met.
TABLE 2 different clustering corresponds to characteristic words and examples of text within classes
Figure BDA0001675721850000071
Figure BDA0001675721850000081
The explanation of each category in the clustering result shows that the category formed by clustering the short texts represented by the method has good explanatory property, and the knowledge subject in the cluster is easier to extract.
(3) Text classification accuracy rate using clustering result as training corpus
And manually classifying the test set documents according to the feature words and the category explanations of the categories to realize category labeling of the test set. And taking the manually marked documents as a test set, and taking the documents of which the classification results are obtained by text clustering as a training set, so as to check the accuracy of the training corpus automatically constructed by clustering. The traditional vector space model method and the word2 vec-based short text representation method are respectively adopted for text representation, a feature selection method TF-IDF consistent with a clustering link is adopted, and classification results based on different classifiers under different feature dimensions are obtained and are shown in Table 3.
TABLE 3 accuracy of different classifiers with respect to feature dimension under different text representation methods
Figure BDA0001675721850000082
In order to more intuitively compare the effect difference of two different text representation methods during text classification, a value histogram of the classification accuracy of the text along with the feature dimension in the different text representation methods can be drawn based on table 3, as shown in fig. 5.
As can be seen from fig. 5, under the word2 vec-based short text representation method, except that the feature dimension is 100 (at this time, there may be too few feature dimensions, and there are not enough feature words available for distinguishing different categories), the training corpus automatically constructed by clustering can obtain a classification accuracy higher than 80%. Meanwhile, the observation shows that under the conditions of different feature dimensions and different classifiers, the classification accuracy of the word2 vec-based short text representation method is higher than that of the traditional vector space model method, the improvement range is only 2.38% except that the feature dimension is 500, and the improvement range is 3.16% to 6.87% under the other conditions, wherein the classifier method is an SVM. The method shows that the corpus constructed by clustering by using the text representation method can better distinguish knowledge subjects in the corpus and obtain better effect in application.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (3)

1. A short text representation method based on word2vec is characterized in that: the method comprises the following steps:
s1: inputting a training text set subjected to text preprocessing, setting word2vec method parameters, and training to obtain a word vector set corresponding to the training text set;
s2: calculating the cosine distance between word vectors for each word in each document to obtain a series of similar words of the word in the whole training text set;
s3: calculating the cosine distance between the similar words in each document and the document;
s4: sorting according to the cosine distances from big to small, and finally selecting the first n similar words and the corresponding cosine distances to form n similar words and cosine measurement of the document;
s5: calculating weights of the words in the documents and the selected n similar words in the documents to form new text representation, and outputting vector space representation of each document after improvement based on word2 vec;
the specific calculation process of step S3 is as follows:
if some words in the document have consistent similar words, the cosine distances of the consistent similar words are added to form the cosine distance between the similar words and the document, otherwise, the original similar words and the cosine distances between the original similar words and the words in the document are kept:
s(t,d)=s(t,t1)+s(t,t2)+s(t,t3)+…+s(t,tn) (1)
wherein, t1,t2,t3,…,tnIs the vocabulary in the document d, s (t, t)n) Representing words in the word t and document dSink tnS (t, d) represents the cosine measure of the word t and the document d.
2. The word2 vec-based short text representation method according to claim 1, wherein: the preprocessing process of the training text set in step S1 includes:
s1.1: constructing a user dictionary to perform word segmentation processing and part-of-speech tagging on the training text;
s1.2: removing stop words according to the existing stop word list, and removing pronouns, prepositions and orientation words according to the part of speech;
s1.3: and the feature selection is carried out by adopting methods such as TF, IDF or TF-IDF and the like, so that the feature dimension is reduced.
3. The word2 vec-based short text representation method according to claim 2, characterized in that: the specific process of calculating the weights of the words in the document and the selected n similar words in the document in the step S5 is as follows:
Figure FDA0003069993960000021
w (t, nd) is the weight of the word t in the document nd added with n adjacent words, and is obtained by a characteristic weight calculation method TF-IDF; s (t, d) represents the cosine measure of the word t and the document d.
CN201810525103.8A 2018-05-28 2018-05-28 Short text representation method based on word2vec Expired - Fee Related CN108804595B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810525103.8A CN108804595B (en) 2018-05-28 2018-05-28 Short text representation method based on word2vec

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810525103.8A CN108804595B (en) 2018-05-28 2018-05-28 Short text representation method based on word2vec

Publications (2)

Publication Number Publication Date
CN108804595A CN108804595A (en) 2018-11-13
CN108804595B true CN108804595B (en) 2021-07-27

Family

ID=64090655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810525103.8A Expired - Fee Related CN108804595B (en) 2018-05-28 2018-05-28 Short text representation method based on word2vec

Country Status (1)

Country Link
CN (1) CN108804595B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162620B (en) * 2019-01-10 2023-08-18 腾讯科技(深圳)有限公司 Method and device for detecting black advertisements, server and storage medium
CN110232128A (en) * 2019-06-21 2019-09-13 华中师范大学 Topic file classification method and device
CN110442873A (en) * 2019-08-07 2019-11-12 云南电网有限责任公司信息中心 A kind of hot spot work order acquisition methods and device based on CBOW model
CN110705304B (en) * 2019-08-09 2020-11-06 华南师范大学 Attribute word extraction method
CN111177401A (en) * 2019-12-12 2020-05-19 西安交通大学 Power grid free text knowledge extraction method
CN112257431A (en) * 2020-10-30 2021-01-22 中电万维信息技术有限责任公司 NLP-based short text data processing method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279288A (en) * 2015-12-04 2016-01-27 深圳大学 Online content recommending method based on deep neural network
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN107590218A (en) * 2017-09-01 2018-01-16 南京理工大学 The efficient clustering method of multiple features combination Chinese text based on Spark

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279288A (en) * 2015-12-04 2016-01-27 深圳大学 Online content recommending method based on deep neural network
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN107590218A (en) * 2017-09-01 2018-01-16 南京理工大学 The efficient clustering method of multiple features combination Chinese text based on Spark

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Word2vec 的文档分类方法;陈杰等;《计算机***应用》;20171115;第159-164页 *

Also Published As

Publication number Publication date
CN108804595A (en) 2018-11-13

Similar Documents

Publication Publication Date Title
CN108804595B (en) Short text representation method based on word2vec
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
Devika et al. Sentiment analysis: a comparative study on different approaches
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
CN109670014B (en) Paper author name disambiguation method based on rule matching and machine learning
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN110705247B (en) Based on x2-C text similarity calculation method
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN110046264A (en) A kind of automatic classification method towards mobile phone document
CN109492105A (en) A kind of text sentiment classification method based on multiple features integrated study
CN112434164A (en) Network public opinion analysis method and system considering topic discovery and emotion analysis
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN110929022A (en) Text abstract generation method and system
CN110674293B (en) Text classification method based on semantic migration
Liu et al. LIRIS-Imagine at ImageCLEF 2011 Photo Annotation Task.
Háva et al. Supervised two-step feature extraction for structured representation of text data
CN109871429B (en) Short text retrieval method integrating Wikipedia classification and explicit semantic features
CN110580286A (en) Text feature selection method based on inter-class information entropy
Yan et al. Sentiment Analysis of Short Texts Based on Parallel DenseNet.
CN113076468A (en) Nested event extraction method based on domain pre-training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210727