CN109885680B

CN109885680B - Short text classification preprocessing method, system and device based on semantic extension

Info

Publication number: CN109885680B
Application number: CN201910060245.6A
Authority: CN
Inventors: 郑建华; 刘双印; 朱蓉; 贺超波; 徐龙琴; 张世龙; 冯大春
Original assignee: Zhongkai University of Agriculture and Engineering
Current assignee: Zhongkai University of Agriculture and Engineering
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2020-05-19
Anticipated expiration: 2039-01-22
Also published as: CN109885680A

Abstract

The invention discloses a short text classification preprocessing method, a system and a device based on semantic extension, wherein the method comprises the following steps: performing primary processing on short texts to be classified to obtain original word vectors; performing semantic expansion processing on each word in the original word vector to obtain an expanded word vector, and further forming a candidate expanded word vector set; performing semantic similarity calculation on the expansion word vectors in the candidate expansion word vector set, and screening to obtain a group of expansion word vectors with the maximum semantic similarity as specific word vectors; and weighting the original word vector and the specific word vector group to obtain the word vector to be classified. The invention effectively overcomes the defect of insufficient information amount of the original text, and simultaneously avoids the limitation of selection of a later-stage classification algorithm due to the adoption of a semantic expansion mode, and simultaneously has better recognition effect on newly-appeared words, thereby providing help for the generalization performance improvement of the subsequent classification algorithm and greatly improving the accuracy of the subsequent classification.

Description

Short text classification preprocessing method, system and device based on semantic extension

Technical Field

The invention relates to the technical field of data processing, in particular to a short text classification preprocessing method, system and device based on semantic extension.

Background

Text classification is a very extensive application scenario currently encountered, for example, a news is required to be classified into sports, politics and the like, or a novel story is required to be classified into science fiction, story, swordsman and the like, and the current text classification method is mainly based on a traditional feature engineering plus machine learning algorithm, or a deep learning algorithm is directly used. However, in the field of text classification, a long text provides a large amount of information, while a short text provides very limited information, so that it is easier to extract characteristic information for the long text, and the short text is harder.

For short text classification, the existing methods focus on studying which classification algorithm is adopted to improve classification accuracy, such as convolutional neural network, multi-model fusion, SVM, and random forest. However, in practice, the difficulty of short text classification is that the text is too short, and the amount of information contained in the text is too small, so that the features input to various classification algorithms are too small, and the classification accuracy is low.

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a method, a system and a device for preprocessing short text classification based on semantic extension, which can improve accuracy.

The technical scheme adopted by the invention is as follows:

a short text classification preprocessing method based on semantic extension comprises the following steps:

performing primary processing on short texts to be classified to obtain original word vectors;

performing semantic expansion processing on each word in the original word vector to obtain an expanded word vector, and further forming a candidate expanded word vector set;

performing semantic similarity calculation on the expansion word vectors in the candidate expansion word vector set, and screening to obtain a group of expansion word vectors with the maximum semantic similarity as specific word vectors;

weighting the original word vector and the specific word vector group to obtain word vectors to be classified;

and inputting the word vector to be classified into a classifier for text classification.

As a further improvement of the short text classification preprocessing method based on the semantic extension, the step of performing preliminary processing on the short text to be classified to obtain an original word vector specifically includes:

performing word segmentation processing on short texts to be classified to obtain word segmentation results;

and performing stop word deletion processing on the word segmentation result to obtain an original word vector.

As a further improvement of the short text classification preprocessing method based on the semantic expansion, the method specifically includes the steps of performing semantic expansion processing on each word in an original word vector to obtain an expanded word vector, and further forming a candidate expanded word vector set, where the step of performing the semantic expansion processing includes:

performing semantic expansion processing on each word in the original word vector to obtain a semantic set corresponding to each word;

extracting an original from an original set corresponding to each word according to a preset mode to form an expanded word vector;

and forming a candidate expansion word vector set according to the obtained expansion word vectors.

As a further improvement of the short text classification preprocessing method based on the semantic expansion, the semantic similarity calculation is performed on the expansion word vectors in the candidate expansion word vector set, and the group of expansion word vectors with the largest average semantic similarity is obtained by screening as the specific word vector, and the step specifically includes:

vectorizing expansion word vectors in the candidate expansion word vector set to obtain a word vector feature set corresponding to the expansion word vectors;

calculating the semantic similarity of any two word vector representations according to the word vector representation set corresponding to the expanded word vectors;

calculating the average semantic similarity of a word vector representation set corresponding to the expansion word vector according to the semantic similarity represented by any two word vectors;

and screening the group of expansion word vectors with the maximum average semantic similarity as the specific word vectors according to the average similarity of the word vector representation set corresponding to each expansion word vector.

The other technical scheme adopted by the invention is as follows:

an semantic extension-based short text classification preprocessing system, comprising:

the preliminary processing unit is used for carrying out preliminary processing on the short texts to be classified to obtain original word vectors;

the semantic expansion unit is used for performing semantic expansion processing on each word in the original word vector to obtain an expanded word vector and further form a candidate expanded word vector set;

the screening unit is used for carrying out semantic similarity calculation on the expansion word vectors in the candidate expansion word vector set, and screening to obtain a group of expansion word vectors with the maximum semantic similarity as specific word vectors;

the weighting processing unit is used for weighting the original word vector and the specific word vector group to obtain a word vector to be classified;

and the input unit is used for inputting the word vectors to be classified into the classifier to classify the texts.

As a further improvement of the short text classification preprocessing system based on the semantic extension, the preliminary processing unit specifically includes:

the word segmentation processing unit is used for carrying out word segmentation processing on the short text to be classified to obtain a word segmentation result;

and the stop word processing unit is used for deleting the stop words from the word segmentation result to obtain an original word vector.

As a further improvement of the short text classification preprocessing system based on the semantic extension, the semantic extension unit specifically includes:

the expansion unit is used for carrying out semantic expansion processing on each word in the original word vector to obtain a semantic set corresponding to each word;

the extraction unit is used for extracting the sememes from the sememe set corresponding to each word according to a preset mode to form an expanded word vector;

and the set forming unit is used for forming a candidate expansion word vector set according to the obtained expansion word vectors.

As a further improvement of the short text classification preprocessing system based on the semantic extension, the screening unit specifically includes:

the vectorization processing unit is used for vectorizing and characterizing the expansion word vectors in the candidate expansion word vector set to obtain a word vector feature set corresponding to the expansion word vectors;

the semantic similarity calculation unit is used for calculating the semantic similarity of any two word vector representations according to the word vector representation set corresponding to the expanded word vector;

the average calculating unit is used for calculating the average semantic similarity of the word vector representation set corresponding to the expansion word vector according to the semantic similarity represented by any two word vectors;

and the word vector screening unit is used for screening the group of expansion word vectors with the maximum average semantic similarity as the specific word vector according to the average similarity of the word vector expression set corresponding to each expansion word vector.

The invention adopts another technical scheme that:

an apparatus for preprocessing short text classification based on semantic extension, comprising:

a memory for storing a program;

a processor for executing the program, the program causing the processor to execute the method for preprocessing short text classification based on the semantic extension.

The invention has the beneficial effects that:

the invention relates to a short text classification preprocessing method, a system and a device based on semantic extension, which obtain word vectors to be classified to replace the original short text to participate in a classification algorithm after the semantic extension, semantic similarity calculation and weighting processing, thereby overcoming the defect of insufficient information of the original text.

Drawings

FIG. 1 is a flowchart illustrating steps of a short text classification preprocessing method based on semantic extension according to the present invention;

FIG. 2 is a block diagram of a short text classification preprocessing system based on semantic extension according to the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings:

referring to fig. 1, the invention relates to a short text classification preprocessing method based on semantic extension, comprising the following steps:

s1, carrying out primary processing on the short texts to be classified to obtain original word vectors;

s2, performing semantic expansion processing on each word in the original word vector to obtain an expanded word vector, and further forming a candidate expanded word vector set;

s3, performing semantic similarity calculation on the expansion word vectors in the candidate expansion word vector set, and screening to obtain a group of expansion word vectors with the maximum semantic similarity as specific word vectors;

s4, weighting the original word vector and the specific word vector group to obtain a word vector to be classified;

and S5, inputting the word vector to be classified into a classifier for text classification.

In the present embodiment, it is assumed that T is applied to a short text_iTwo groups of expansion word vectors p, q are selected, respectively

Therefore, a new text, namely a word vector to be classified is formed by the link of the three parts to replace the original short text and put into a classification algorithm.

But to enhance the original short text T_iThe invention proposes to apply to the original short text T_iThe weighting mode is adopted, namely the new alternative text is as follows:

wherein w>1。

And performing the processes of S1-S4 on all the short texts to form a new substitute text data set, and then obtaining the classification condition of each short text by using the new substitute text data set through various existing text classification algorithms.

Further as a preferred embodiment, the preliminary processing is performed on the short text to be classified to obtain an original word vector, and this step specifically includes:

s11, performing word segmentation processing on the short texts to be classified to obtain word segmentation results;

and S12, performing stop word deletion processing on the word segmentation result to obtain an original word vector.

In this embodiment, each short text may be segmented by using any segmentation tool (e.g., jieba, etc.), and then preset stop words like "of", "ground", "get" in the short text are deleted, and then the following original word vectors are obtained:

in the formula T_iRepresents the ith short text word vector,

representing the first word in the text,

represents the C-th in the short text_iWord, C_iAlso indicating the number of words that are dropped out of the short text.

Further as a preferred embodiment, the performing semantic expansion processing on each word in the original word vector to obtain an expanded word vector, and further forming a candidate expanded word vector set specifically includes:

s21, performing semantic expansion processing on each word in the original word vector to obtain a semantic set corresponding to each word;

wherein the original word vector of each short text

Each word in (1)

And to the concepts of the foregoing, the invention is thus directed toAnd (3) realizing the expression extension of the short text based on the sememe of each concept of the known network.

Let the jth word in each short text i be

The word correspondence has a semantic set in the concept semantic expression of the knowledge network

Wherein

Representing the 1 st of the set of sememes, the word always having a common CS_jAnd (4) an original meaning.

Other words of similar short text i

A corresponding set of semaphores may also be constructed,

so for each short text primitive word vector T_iA sequence of the set of sememes can be obtained

S22, extracting the sememes from the sememe set corresponding to each word according to a preset mode to form an expanded word vector;

the preset method for constructing a group of expanded word vectors in this embodiment is as follows: respectively from the original word vector T_iCorresponding sequence of the set of sememes T _ Sem_iEach of the primitive collection items of

Extracting one or two sememes to form an extended word vector in a combination manner, for example, extracting a first item of each sememe set to form an extended word vector, which is expressed as follows:

for an original word vector T_iWill exist

Such an expanded word vector forms a set of expanded word vectors, i.e.

And S23, forming a candidate expansion word vector set according to the obtained expansion word vectors.

Further as a preferred embodiment, the semantic similarity calculation is performed on the expansion word vectors in the candidate expansion word vector set, and the group of expansion word vectors with the largest average semantic similarity is obtained by screening as the specific word vector, where the step specifically includes:

s31, vectorizing the expansion word vectors in the candidate expansion word vector set to obtain a word vector feature set corresponding to the expansion word vectors;

the embodiment of the invention adopts word2vec technology, utilizes a Wikipedia or dog searching corpus as a training corpus, and uses an extended word vector

Each semantic word in the vector is expressed in the form of a vector, the vector can be set to 50,100,300 and other vectors with different dimensions, and each value in the vector is a floating-point type numerical value, so that vectorization representation of each word is completed.

Such as

In (1)

Is characterized by

Is characterized by

Namely, the word2vec vector table set corresponding to the candidate expansion word vector is obtained

S32, calculating the semantic similarity of any two word vector representations according to the word vector representation set corresponding to the expanded word vector;

the method of the present invention is not limited to the similarity calculation method, and the cosine similarity calculation is only taken as an example in this embodiment. If two vectors A and B are set, the cosine similarity calculation formula of the two vectors A and B is as follows:

the invention adopts a similarity calculation method (such as the cosine similarity calculation formula) to calculate the semantic similarity represented by any two word vectors:

s33, calculating the average semantic similarity of the word vector representation set corresponding to the expansion word vector according to the semantic similarity represented by any two word vectors;

each word2vec vector feature set

In common among such similarities

Therefore, a vector feature set can be calculated

Has an average similarity of：

This is a short text T_iThe average similarity of the first expansion word vector of (2), similarly, the short text T can be calculated_iThe average similarity of other expansion word vectors to obtain a short text T_iThe average similarity vector of the expanded word vectors of (2) is as follows:

according to the meaning of cosine similarity, the similarity range is given from-1 to 1: a 1 means that the two vectors point in exactly the opposite direction, a 1 means that their points are exactly the same, a 0 usually means that they are independent, and a value between them means an intermediate similarity or dissimilarity.

S34, screening the group of expansion word vectors with the maximum average semantic similarity as the specific word vector according to the average similarity of the word vector expression set corresponding to each expansion word vector.

The invention will be on sim (V)ⁱ) Performing a screening operation to select

The term with the largest value indicates that the closer the semantic association relationship of the corresponding expansion word vector, the more likely it is an expansion vector that can replace the original short text. Of course, the invention also proposes to choose

The largest two terms are applied to the expanded word vector.

Referring to fig. 2, the invention relates to a short text classification preprocessing system based on semantic extension, comprising:

Further as a preferred embodiment, the preliminary treatment unit specifically includes:

Further preferably, the sense extension unit specifically includes:

Further as a preferred embodiment, the screening unit specifically includes:

The invention also comprises a short text classification preprocessing device based on the semantic extension, which specifically comprises the following steps:

a memory for storing a program;

According to the method, the short text information is expanded, the application of a later text classification algorithm is facilitated, and the classification accuracy can be effectively improved. Compared with the existing vocabulary of the data set, the expansion mode designed by the traditional method easily causes the classification algorithm selected at the later stage to have limitation or is difficult to have better recognition effect on the new words appearing in the test set. The invention provides a method for expanding each word in a short text by using externally associated primitive words to finally form a replaced text, and the length of the replaced text can be flexibly controlled, so that the method has the advantages of no limitation on the selection of a later stage, is suitable for training set data and test set data, and has better recognition effect on newly appeared words in future detection.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A short text classification preprocessing method based on semantic extension is characterized by comprising the following steps:

inputting the word vectors to be classified into a classifier for text classification;

the semantic similarity calculation is performed on the expansion word vectors in the candidate expansion word vector set, and the group of expansion word vectors with the largest average semantic similarity is obtained by screening and is used as the specific word vector, and the method specifically comprises the following steps:

2. The method of claim 1, wherein the short text classification preprocessing method based on the semantic extension comprises: the method comprises the following steps of performing preliminary processing on short texts to be classified to obtain an original word vector, wherein the steps specifically comprise:

3. The method of claim 1, wherein the short text classification preprocessing method based on the semantic extension comprises: the method specifically includes the steps of performing semantic expansion processing on each word in an original word vector to obtain an expanded word vector, and further forming a candidate expanded word vector set, wherein the steps include:

4. A short text classification preprocessing system based on semantic extension, comprising:

the input unit is used for inputting the word vectors to be classified into the classifier to classify the texts;

the screening unit specifically comprises:

5. The system of claim 4, wherein the short text classification preprocessing system based on the semantic extension comprises: the preliminary processing unit specifically comprises:

6. The system of claim 4, wherein the short text classification preprocessing system based on the semantic extension comprises: the semantic extension unit specifically comprises:

7. An apparatus for preprocessing short text classification based on semantic extension, comprising:

a memory for storing a program;

a processor for executing the program, the program causing the processor to execute the method of the short text classification preprocessing based on the semantic extension according to any one of claims 1 to 3.