CN113378950A

CN113378950A - Unsupervised classification method for long texts

Info

Publication number: CN113378950A
Application number: CN202110691284.3A
Authority: CN
Inventors: 林正春; 兰林; 陈功文
Original assignee: Shenzhen Chace Network Information Technology Co ltd
Current assignee: Shenzhen Chace Network Information Technology Co ltd
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2021-09-10

Abstract

The invention relates to an unsupervised classification method of long texts, which comprises the following steps: filtering the long text to be classified, and extracting three parts of a title text, a body text and a text of a text-sending department in the long text to be classified; extracting weight coefficients of the title text, the body text and the text of the text sending department; fusing the title text, the body text and the text of the text department of the text issue into a new long text T according to the extracted weight coefficient; performing Chinese word segmentation on the new long text T, and extracting word segmentation information; inputting the word segmentation information into a word vector model to obtain word vector information; calculating a feature vector of the long text T according to the word vector information; and clustering the feature vectors of the long text T to obtain a text classification. According to the invention, the method for classifying the long text is improved, the time complexity of long text classification is reduced, the accuracy of long text classification is improved, and the user can read and classify the long text more conveniently.

Description

Unsupervised classification method for long texts

Technical Field

The invention relates to the technical field of network information, in particular to an unsupervised long text classification method.

Background

The nation and government have developed various texts required by developing enterprises in order to support their better development. Under the guidance of related texts, enterprise development can more directly and accurately understand government guidance and understand the market to a great extent, so that products meeting the market requirements better are produced. A government preferential support text relates to multiple aspects of department recruitment, tax deduction, financing support, environment optimization, recruitment and intelligence introduction and the like, and directly or indirectly promotes the healthy development of enterprises. Various texts of the country and the government become important bases for planning the development and development of enterprises.

The accurate reading and classification of texts, particularly long texts, has become an important development subject which needs to be solved urgently in enterprise development. Because the long text has the characteristics of longer text and certain difference between the text structure and word-using habit and the common text, the time complexity is high when classification is executed, and the classification accuracy is low. There is relatively little current research on long-text classification.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides an unsupervised long text classification method, reduces the time complexity of long text classification, improves the accuracy of long text classification, and is convenient for reading and classifying long texts.

The purpose of the invention is realized by the following technical scheme:

an unsupervised classification method of long texts comprises the following steps:

(1) filtering the long text to be classified, and extracting the title text t in the long text to be classified₁Text t₂And text t of the issue department₃Three parts;

(2) extracting the heading text t₁Text t₂And text t of the issue department₃Weight coefficient c of three parts₁、c₂、c₃；

(3) Extracting the title text t according to the extracted weight coefficient₁Text t₂And text t of the issue department₃Fused into a new lengthA text T;

(4) performing Chinese word segmentation on the new long text T, and extracting word segmentation information; inputting the word segmentation information into a word vector model to obtain word vector information;

(5) calculating a feature vector of the long text T according to the word vector information;

(6) and clustering the feature vectors of the long text T to obtain a text classification.

Further, the step (2) is specifically as follows:

2.1, for the title text t₁Text t₂And text t of the issue department₃Performing a pretreatment using x²Respectively selecting characteristic words of each text by a statistical method;

2.2 Using TF-IDF and based on x, according to the characteristic words of the respective text²And respectively extracting the weight coefficient of each text by using a characteristic evaluation function of the statistical method.

Further, the step (3) is specifically as follows:

3.1, calculating each text t respectively_iLength L of_iTaking L as max (L)₁,L₂,L₃)，c＝max(c₁， c₂，c₃) Wherein i is 1,2, 3;

3.2, text t_iReplication of L/L_i*c_iC parts and are linked together to give the text T_i；

3.3, text T₁、T₂、T₃The sequential concatenation together results in a new long text T.

Further, in the step (4), the open source chinese word segmentation tool HanLP used performs chinese word segmentation on the new long text T, removes stop words, and extracts word segmentation information.

Further, in the step (4), the word vector model is a CBOW model and is constructed by a word2vec module of the generic package.

Further, the step (5) is specifically as follows:

5.1, sequencing the vectors in the word vector information in an ascending order, and taking the first N vectors, wherein N is less than the total number of the vectors in the word vector information;

and 5.2, calculating the centers of the first N vectors to obtain the feature vector of the long text T.

The invention has the beneficial effects that: the extracted texts are fused into a new long text by extracting the weight coefficients of the title text, the body text and the text of the text department; and extracting the weight coefficient of each text according to the feature words of each text, so that the fused new long text comprises the feature word pairs and the associated information between the feature words, and the accuracy of text classification is improved. The method comprises the steps of carrying out Chinese word segmentation on a new long text, extracting word segmentation information, filtering out irrelevant information, inputting the word segmentation information into a word vector model to obtain accurate word vector information, measuring semantic similarity of sentences, accurately calculating similarity between texts and facilitating classification of similar texts. According to the word vector information, the feature vectors of the long text T are calculated, clustering is carried out on the feature vectors to obtain text classification, the method for classifying the long text is improved through the weight coefficients and the feature vectors, and the precision and the effect of the method for classifying the long text are improved.

Drawings

Fig. 1 is a schematic structural diagram of an unsupervised long text classification method according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

(1) because the long text is long, the text structure mainly comprises three parts of a title, a text and a text sending department, before the long text to be classified is classified, the long text to be classified needs to be filtered, and the title text t in the long text to be classified is extracted₁Text t₂And text t of the issue department₃Three parts.

(2) The title text t is then extracted₁Text t₂And text t of the issue department₃Weight coefficient c of three parts₁、c₂、c₃. The weight coefficient is used for indicating that each text is waitingAnd measuring the important degree index in the text.

The preferred technical scheme is to title text t₁Text t₂And text t of the issue department₃Performing a pretreatment using x²Respectively selecting characteristic words of each text by a statistical method; using TF-IDF and x-based on the feature words of each text²The feature evaluation function of the statistical method respectively extracts the weight coefficient of each text, and the associated information between feature words can be obtained by extracting the weight coefficient of each text through the feature words, so that the accuracy of text classification is improved.

(3) The title text t is weighted according to the weight coefficient₁Text t₂And text t of the issue department₃The method is fused into a new long text T, and comprises the characteristic word pairs and the associated information between the characteristic words, so that the text classification accuracy can be effectively improved.

The preferred technical scheme is to respectively calculate each text t_iLength L of_iTaking L as max (L)₁,L₂,L₃)，c ＝max(c₁，c₂，c₃) Wherein i is 1,2, 3; text t_iReplication of L/L_i*c_iC parts and are linked together to give the text T_i(ii) a Text T₁、T₂、T₃The sequential concatenation together results in a new long text T. Adjusting the length of the extracted text to extract the title text t₁Text t₂And text t of the issue department₃Fusing into a new long text T that is easier to interpret and classify.

(4) Carrying out Chinese word segmentation on the new long text T, extracting word segmentation information, and filtering out irrelevant information; the word segmentation information is input into a word vector model to obtain word vector information which is used for measuring semantic similarity of sentences, accurately calculating similarity between texts and facilitating classification of similar texts.

According to the preferable technical scheme, an open source Chinese word segmentation tool HanLP is used for carrying out Chinese word segmentation on a new long text T, stop words are removed, and word segmentation information is extracted. The open source Chinese word segmentation tool HanLP is a tool kit consisting of a series of model pre-algorithms, can provide functions of lexical analysis, syntactic analysis, text analysis, emotion analysis and the like, has the characteristics of complete functions, high performance, clear architecture, new linguistic data, customization and the like, adopts a series of high-speed data structures at the bottom layer, has a word segmentation rate of 2,000 ten thousand characters per second, and only needs 120MB relatively using a memory; in the aspect of IO, the dictionary loading speed is very high, the dictionary can be quickly started only by 500ms, and the using efficiency and the applicability of the dictionary are more suitable for Chinese word segmentation of long texts.

According to the preferred technical scheme, the word vector model is a CBOW model, the word2vec module of the generic package is used for construction, the model is simple to construct, and word vectorization is convenient to carry out. In the cbow model, the central word is predicted through the peripheral words, so that the vector of the peripheral words is continuously adjusted by utilizing the prediction result condition of the central word and using a GradientDesent method; after word segmentation information is input into the CBOW model, each word can be used as a central word, the CBOW model performs unified adjustment on surrounding words, word vectors of all words in the whole text are obtained, the obtained gradient value can be applied to the word vectors of the surrounding words, the number of times of predicting behaviors of the CBOW model is almost equal to the number of words in the whole text, the efficiency is higher, and the speed is higher.

the preferred technical scheme is that the vectors in the word vector information are sorted in an ascending order, the first N vectors are taken, and N is less than the total number of the vectors in the word vector information; and calculating the centers of the first N vectors to obtain the feature vector of the long text T.

(6) The feature vectors of the long text T are clustered to obtain the text classification, the accuracy is higher, and the time complexity of classifying the long text is lower.

The invention solves the problems of high time complexity and low classification accuracy when classification is executed due to long text with long text, text structure and word using habit, further improves the precision and effect of the long text classification method by improving the long text classification method, and is more suitable for reading and classifying the long text.

It should be understood that the above-described embodiments are merely preferred examples of the present invention and the technical principles applied thereto, and any changes, modifications, substitutions, combinations and simplifications made by those skilled in the art without departing from the spirit and principle of the present invention shall be covered by the protection scope of the present invention.

Claims

1. An unsupervised classification method of long texts is characterized by comprising the following steps:

(3) Extracting the title text t according to the extracted weight coefficient₁Text t₂And text t of the issue department₃Fusing to form a new long text T;

2. The unsupervised classification method for long texts according to claim 1, wherein the step (2) is specifically as follows:

2.2 Using TF-IDF and based on x, according to the characteristic words of the respective text²Respectively extracting each characteristic evaluation function of the statistical methodThe weight coefficient of each text.

3. The unsupervised classification method for long texts according to claim 1, wherein the step (3) is specifically as follows:

3.1, calculating each text t respectively_iLength L of_iTaking L as max (L)₁,L₂,L₃)，c＝max(c₁，c₂，c₃) Wherein i is 1,2, 3;

4. The unsupervised classification method for long texts as claimed in claim 1, wherein in the step (4), the open source chinese segmentation tool HanLP is used to perform chinese segmentation on the new long text T, and remove stop words to extract segmentation information.

5. The unsupervised classification method for long texts according to claim 1, wherein in the step (4), the word vector model is a CBOW model and is constructed by word2vec module of gensim package.

6. The unsupervised classification method for long texts according to claim 1, wherein the step (5) is specifically as follows: