CN108804595B

CN108804595B - Short text representation method based on word2vec

Info

Publication number: CN108804595B
Application number: CN201810525103.8A
Authority: CN
Inventors: 路永和; 张炜婷
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2021-07-27
Anticipated expiration: 2038-05-28
Also published as: CN108804595A

Abstract

The invention relates to a word2 vec-based short text representation method, which comprises the following steps of: s1: inputting a training text set subjected to text preprocessing, setting word2vec method parameters, and training to obtain a word vector set corresponding to the training text set; s2: calculating the cosine distance between word vectors for each word in each document to obtain a series of similar words of the word in the whole training text set; s3: calculating the cosine distance between the similar words in each document and the document; s4: sorting according to the cosine distances from big to small, and finally selecting the first n similar words and the corresponding cosine distances to form n similar words and cosine measurement of the document; s5: and calculating the weights of the words in the documents and the selected n similar words in the documents to form new text representation, and outputting the vector space representation of each document after improvement based on word2 vec.

Description

Short text representation method based on word2vec

Technical Field

The invention relates to the field of computer science and technology, in particular to a word2 vec-based short text representation method.

Background

In the text mining process, the machine for reading the sample information needs to firstly go through a text representation link and convert the sample into a numerical value. With the continuous expansion of natural language processing range and the development of computer technology, how to use numerical values to better represent semantic information represented by texts is always one of the crucial research points in the field of text processing, because the semantic information directly influences text mining effect. For the problem of short text mining, an effective text feature representation method is a difficult point of research, and particularly, short texts generated by a social platform have the traditional problems of feature sparseness, incomplete semantics, one-word-multiple meaning, multiple-word-one meaning and the like, and also have the characteristics of random expression, misuse of new words, large quantity and the like.

Commonly used text representation models are boolean models, probabilistic models, and Vector Space models, with the most commonly used text representation Model being the Vector Space Model (Vector Space Model) proposed by Gerard slide et al in 1958. The basic idea of the vector space model is to represent a text by using vectors, i.e. selecting partial feature words from a training set, and then using each feature word as one dimension of a vector space coordinate system, so that the text is formed into a vector in a multi-dimensional vector space, wherein each text is a point in an n-dimensional space, and similarity between the texts can be measured by an included angle between vectors or a distance between vectors (Tai De Yi, Wang. text classification feature weight improvement algorithm [ J ] computer engineering, 2010,36(9): 197-. However, the vector space model has the defect that the data space expression is sparse and the semantic information between words is ignored, which results in a slightly weaker representation capability of the vector space model on short texts. Some researchers have attempted to correct these defects, such as Wang B K, etc. by proposing a string feature the result OF latent dirichlet allocation and information gain (SFT), which combines LDA and IG to increase the weight OF vocabulary, thereby selecting a feature word with stronger semantic information (Wang B K, Huang Y F, Yang W X, et al. short text classification based on string feature the result OF the feature [ J ]. joural OF zhejiangg property-SCIENCE C-components & electrorics, 2012,13(9): 649-. Yang Lili et al propose a Semantic extension method by combining words and Semantic Features of Short Text, which utilizes Wikipedia as a background knowledge base to obtain Semantic Features of words, and recalculates feature word weights based on combinations of words and semantics (Yang L, Li C, Ding Q, et al. combining lessical and Semantic Features for Short Text Classification [ J ]. Procedia Computer Science,2013,22(0): 78-86.).

In 2013, Google's Tomas Mikolov team issued an open source word vector generation tool based on deep learning — word2vec (Mikolov T, Le Q V, Sutskeeper I. expanding criteria animals sources for machine translation [ J ]. arXiv prediction arXiv:1309.4168,2013.Mikolov T, Chen K, Corrado G, et al. efficiency estimation of word reproducibility in vector space [ J ]. arXiv prediction arXiv:1301.3781,2013.). The algorithm can learn high-quality word vectors from a large-scale real document corpus in a short time and is used for conveniently calculating semantic similarity between words. word2vec can not only find semantic information among words, but also provide a new solution for the problem that a vector space model is sparse in short text expression.

Disclosure of Invention

The invention aims to provide a short text representation method based on word2vec aiming at the problems of data space sparseness and semantic missing of a Vector Space Model (VSM), and a knowledge theme can be better extracted by using a clustering result of short texts represented by the short text representation method based on word2 vec.

In order to realize the purpose, the technical scheme is as follows:

a short text representation method based on word2vec comprises the following steps:

s1: inputting a training text set subjected to text preprocessing, setting word2vec method parameters, and training to obtain a word vector set corresponding to the training text set;

s2: calculating the cosine distance between word vectors for each word in each document to obtain a series of similar words of the word in the whole training text set;

s3: calculating the cosine distance between the similar words in each document and the document;

s4: sorting according to the cosine distances from big to small, and finally selecting the first n similar words and the corresponding cosine distances to form n similar words and cosine measurement of the document;

s5: and calculating the weights of the words in the documents and the selected n similar words in the documents to form new text representation, and outputting the vector space representation of each document after improvement based on word2 vec.

Preferably, the preprocessing procedure of the training text set in step S1 includes:

s1.1: constructing a user dictionary to perform word segmentation processing and part-of-speech tagging on the training text;

s1.2: removing stop words according to the existing stop word list, and removing pronouns, prepositions and orientation words according to the part of speech;

s1.3: and the feature selection is carried out by adopting methods such as TF, IDF or TF-IDF and the like, so that the feature dimension is reduced.

Preferably, the specific calculation process of step S3 is as follows:

if some words in the document have consistent similar words, the cosine distances of the consistent similar words are added to form the cosine distance between the similar words and the document, otherwise, the original similar words and the cosine distances between the original similar words and the words in the document are kept:

s(t,d)＝s(t,t₁)+s(t,t₂)+s(t,t₃)+…+s(t,t_n) (1)

wherein, t₁,t₂,t₃,…,t_nIs the vocabulary in the document d, s (t, t)_n) Representing words t and words t in documents d_nS (t, d) represents the cosine measure of the word t and the document d.

Preferably, the specific process of calculating the weight of the word in the document and the selected n similar words in the document in the step S5 is as follows:

w (t, nd) is the weight of the word t in the document nd added with n adjacent words, and is obtained by a characteristic weight calculation method TF-IDF; s (t, d) represents the cosine measure of the word t and the document d.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention provides a short text representation method based on word2vec, wherein the word2vec is used for finding out similar words of each word in a text, and then the similar words of the text are obtained through calculation and are used as the expansion of the characteristics of the text in a vector space model.

(2) The experimental result shows that the short text representation method based on word2vec has the performance which is obviously superior to that of the traditional vector space model in the text clustering link and the text classification link of the experiment, the average value of the DB _ index of the clustering link is reduced by 0.704, the average value of the classification accuracy of the classification link is improved by 4.614%, and the short text representation method based on word2vec improves the clustering effect in two aspects of technology and application and can better extract the knowledge subject in the corpus.

Drawings

FIG. 1 Process for representing short text by word2 vec-based vector space model improvement method

FIG. 2 is a DB _ index line graph of text represented by a traditional vector space model method and changing with feature dimensions under different clustering numbers

FIG. 3 is a DB _ index line graph showing text variation with feature dimensions under different cluster numbers based on the method of the present invention

FIG. 4 is a histogram of clustered DB _ index values of a text represented by a conventional vector space model method and the method of the present invention

FIG. 5 is a histogram of classification accuracy of text as a function of feature dimensions based on a conventional vector space model method and methods described herein

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the invention is further illustrated below with reference to the figures and examples.

Example 1

The present invention and other features and advantages thereof will be described in more detail below with reference to the accompanying drawings, in which a comprehensive two-child policy short text corpus is taken as an example.

The acquisition and preprocessing process of the training text set is as follows:

a comprehensive two-child policy short text corpus used for experiments is obtained by crawling a Xinlang microblog, and 102300 pieces of available data are obtained as an experiment corpus after necessary cleaning and filtering are carried out on the captured data. After Chinese segmentation and part-of-speech tagging of a text are completed by using a word segmentation system Java edition of NLPIR2016, a work-in-the-great stop word list is imported to remove stop words, and meanwhile, words without practical meanings such as pronouns, prepositions, azimuth words and the like are removed according to the part-of-speech. In the feature selection process, the unsupervised feature selection methods mainly included at present are TF, IDF and TF-IDF, and the TF-IDF method is selected for feature selection in the embodiment, so that the feature dimension is reduced.

(1) Short text presentation process

As shown in fig. 1, a process of representing a short text by using a word2 vec-based short text representation method includes the following specific steps:

s1: before a word2 vec-based short text representation method is adopted, a Google word2vec open source tool is used for generating a word vector document for input data in a Linux environment. The text data after the text preprocessing of the training set is used as a data set for word2vec word vector generation, parameters are set to be-cbow 0-size 200-window 5-negative 0-hs 1-sample 1e-3-threads 12-binary 1, a Skip-Gram model is used, the size of a training window is 5, and a 200-dimensional word vector is generated.

S2: a series of similar words of the words in the text in the whole training text set are obtained by utilizing a word2vec method according to word vector calculation, and a table 1 shows the words in one text, the corresponding similar words and cosine distance values of the words in the one text.

Table 1 approximate words and cosine distances of partial words obtained by word2vec method

S3: and (3) calculating the cosine distance between the similar words of each document and the document according to the formula (1).

S4: the selection of the number n of the similar words of the document is considered, if the value of n is too small, the number of the similar words considered to participate in calculation of each document after the feature selection is too small; if the value of n is too large, the calculation amount and the running time of a text representation link are greatly increased, the invention sets n to be 50, namely, the first 50 similar words and the corresponding cosine distance of the document are selected as the expansion characteristics of the document.

S5: and (3) calculating the weights of the words in the documents and the selected n similar words in the documents according to the formula (2), forming a new text representation, and outputting a vector space representation of each document after improvement based on word2 vec.

(2) Evaluation method

And respectively calculating DB _ index of different feature dimensions selected by the documents represented by the conventional method and the word2 vec-based short text representation method and determining the cluster number by seeking the minimum DB _ index by using a K-means clustering method.

A DB _ index value-taking line graph that varies with feature dimensions based on different clustering numbers in the conventional vector space model method is shown in fig. 2. A graph showing the change of the DB _ index value along with the feature dimension under different clustering numbers based on the word2vec short text representation method is shown in FIG. 3.

As can be seen from fig. 2 and 3, no matter the conventional vector space model method or the word2 vec-based short text representation method is used, when the clustering number is 13, the intra-class dispersion and the inter-class separation of the feature dimensions remain relatively stable, the classification is relatively stable, and the DB _ index has the minimum value, so that the clustering number 13 is selected as the optimal clustering number.

(1)DB_index

Where k is the number of clusters, d_ijIs the distance between two class centers, S_iIs the average distance of the samples in class i to the center of the class.

Because the optimal clustering effect is obtained when the clustering number is 13, the clustering effects of the two text representation methods are compared when the clustering number is 13. FIG. 4 is a DB-index value histogram of two text representation methods under different feature dimensions when the number of clusters is 13. As can be seen from FIG. 4, when the number of clusters is 13 and the feature dimension is between 200 and 2000, the word2 vec-based short text representation method can obtain a lower DB-index value than the conventional vector space model method. The text representation is performed by using a word2 vec-based short text representation method, so that the text can be better represented, and the text can obtain smaller intra-cluster aggregation degree and larger inter-cluster separation degree in the clustering process.

(2) Interpretation of clustering results

Considering that DB _ index has a minimum value of 1.168 when the feature dimension of the word2 vec-based short text representation method is 200, the clustering result at this time is explained as shown in table 2.

As can be seen from table 2, after the overall policy for the second child is opened, there are direct civil problems to be paid attention to and solved, including category 1 educational medicine, category 4 late marriage and late childbirth, and category 11 female employment, which are related problems directly fed back by the public at the first time after the overall policy for the second child is opened, and these problems are negative effects brought by the overall policy for the second child, and should cause related units to pay attention and take corresponding measures. Meanwhile, the

Claims

1. A short text representation method based on word2vec is characterized in that: the method comprises the following steps:

s5: calculating weights of the words in the documents and the selected n similar words in the documents to form new text representation, and outputting vector space representation of each document after improvement based on word2 vec;

the specific calculation process of step S3 is as follows:

s(t,d)＝s(t,t₁)+s(t,t₂)+s(t,t₃)+…+s(t,t_n) (1)

wherein, t₁,t₂,t₃,…,t_nIs the vocabulary in the document d, s (t, t)_n) Representing words in the word t and document dSink t_nS (t, d) represents the cosine measure of the word t and the document d.

2. The word2 vec-based short text representation method according to claim 1, wherein: the preprocessing process of the training text set in step S1 includes:

3. The word2 vec-based short text representation method according to claim 2, characterized in that: the specific process of calculating the weights of the words in the document and the selected n similar words in the document in the step S5 is as follows: