CN111061880A

CN111061880A - Method for rapidly clustering massive text data

Info

Publication number: CN111061880A
Application number: CN201911347726.1A
Authority: CN
Inventors: 陈泽勇; 张治同; 李志强; 姚松; 张莉
Original assignee: Chengdu Dippmann Information Technology Co Ltd; Sichuan University
Current assignee: Chengdu Dippmann Information Technology Co Ltd; Sichuan University
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-04-24

Abstract

The invention provides a method for rapidly clustering massive text data, which comprises the steps of preprocessing command line parameters input from the outside and text information read under an appointed directory, calling a preset structure body through an internal interface to finish clustering the text data under the appointed directory, outputting an EXCEL file or a graphical interface clustering result under the appointed directory, and evaluating a clustering effect; the method specifically comprises the following steps: reading text data, preprocessing text information, clustering the text data and outputting a clustering result; wherein the text information preprocessing comprises: s1: performing word segmentation on the Chinese document, and performing TOKEN processing on the English document; s2: removing stop words; s3: calculating a simhash code of the document without the stop word; s4: word embedding is carried out in a word2vector mode, and a document vector with stop words removed is calculated; s5: word embedding is carried out in a bert vector mode to obtain a word vector; the method realizes the optimal clustering algorithm strategy of the clustering algorithm through internal or external evaluation.

Description

Method for rapidly clustering massive text data

Technical Field

The invention relates to the field of text clustering, in particular to a method for quickly clustering massive text data.

Background

The text clustering is to carry on the aggregation of the similar document according to the characteristic that the similarity of the same document is great, the similarity of different documents is minor, the clustering method is concentrated on improving the accuracy rate of clustering at present, to clustering efficiency, it is long to process the text data of thousands, ten thousand grades of quantity, and the accuracy rate of the general clustering algorithm can't meet the clustering demand of such a huge amount of text data too, because adopt different clustering algorithms to cluster the influence of accuracy rate and clustering time different, therefore, in the clustering assessment to a certain clustering algorithm, can regard clustering accuracy rate and clustering time as the observation point, in order to carry on effective judgement to the clustering effect, thus carry on the high-efficient clustering algorithm selection for different clustering demands; at present, the clustering process and the evaluation of the clustering effect of the text data with the quantity of thousands or tens of thousands are relatively lacked.

Disclosure of Invention

The invention aims to provide a method for quickly clustering mass text data, which is suitable for being executed in computer equipment, commands line parameters input by an external interface and text information read under an appointed directory are preprocessed, a preset structure body is called through an internal interface to finish clustering the text data under the appointed directory, EXCEL files or graphical interface clustering results under the appointed directory are output, and the clustering effect is evaluated.

Further, the command line parameters comprise a clustering algorithm, a word vector coding mode, a text distance measurement mode and an evaluation mode;

the clustering algorithm comprises a K mean value, single-pass clustering, hierarchical clustering and density clustering;

the word vector coding mode comprises a simhash code, a word2vector and a bert vector;

the text distance measurement mode comprises an Euclidean distance, a Hamming distance, an included angle cosine distance and a K-L divergence;

the evaluation mode comprises internal evaluation and external evaluation.

Further, the clustering method comprises the following steps:

reading text data: reading text information from a text file of a specified directory;

preprocessing text information: completing word embedding through a word vector coding mode to obtain a word vector;

text data clustering treatment: clustering and effect evaluation are carried out on the preprocessed word vectors by combining at least one clustering algorithm and an internal or external clustering algorithm evaluation system;

and (3) outputting a clustering result: and outputting the clustering result in an EXCEL format or a graphical interface.

Further, the similarity of the clustered texts is judged by the simhash coding based on a text distance measurement mode; the word2vector is superposed by word vectors and then averaged to obtain a sentence vector, and the sentence vector is superposed and then averaged to obtain a document vector; the bert vector provides WEB services in an HTTP mode through a WEB service layer at a DOCKER container level on the basis of a word2vector so as to meet the requirement of a plurality of task request services and perform efficiently.

Further, in the word vector encoding method, all word vectors adopt floating point vectors of 200 dimensions.

Further, the text information preprocessing step includes the following sub-steps:

s1: performing word segmentation on the Chinese document, and performing TOKEN processing on the English document;

s2: removing stop words;

s3: calculating a simhash code of the document without the stop word;

s4: word embedding is carried out in a word2vector mode, and a document vector with stop words removed is calculated;

s5: and embedding words by adopting a bert vector mode to obtain a word vector.

Further, the word vector is used for word embedding representation in a text information preprocessing stage.

Further, the structure body comprises a map structure body and a list structure body, the list structure body stores the intermediate clustering result, and the map structure body stores the clustering result.

The system for rapidly clustering the mass text data comprises a data extraction module, a clustering operation module and a clustering result output module; the clustering operation module comprises a preprocessing module, a clustering algorithm execution module and a clustering effect evaluation module.

Further, the data extraction module reads text information from a text file of the specified directory and sends the text information to the preprocessing module;

the preprocessing module receives command line parameters input by an external interface and text information sent by the data extraction module and then executes a text information preprocessing step, and sends the command line parameters and preprocessed word vectors to the clustering algorithm execution module;

the clustering algorithm execution module calls a preset structure body through an internal interface to finish clustering the text data under the specified directory; the clustering effect evaluation module evaluates the clustering effect through an internal or external clustering algorithm evaluation system;

the clustering result output module outputs EXCEL files or graphical interface clustering results under a specified directory; when an internal evaluation system is used, an internal evaluation index value for the clustering effect is output.

The invention has the beneficial effects that: the method can perform clustering operation on a given text file data set, and achieves clustering efficiency that the clustering time of ten thousand-level quantity of text file data is less than or equal to 10 minutes and the clustering time of thousand-level quantity of text file data is less than or equal to 1 minute through not less than 4 conventional clustering algorithms, not less than 3 text distance measurement modes and 2 different clustering result evaluation modes; meanwhile, the accuracy and the clustering efficiency of the clustering algorithm are judged through an internal or external clustering algorithm evaluation system, and an optimal clustering algorithm strategy is realized.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a control chart for implementing the clustering algorithm of the present invention.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.

A method for quickly clustering mass text data is suitable for being executed in computer equipment, command line parameters input by an external interface and text information read under an appointed directory are preprocessed, then a preset structural body is called through an internal interface to finish clustering the text data under the appointed directory, an EXCEL file or a graphical interface clustering result under the appointed directory is output, and a clustering effect is evaluated.

The command line parameters comprise a clustering algorithm, a word vector coding mode, a text distance measurement mode and an evaluation mode; the clustering algorithm comprises a K mean value, single-pass clustering, hierarchical clustering and density clustering; the word vector coding mode comprises a simhash code, a word2vector and a bert vector; the text distance measurement mode comprises an Euclidean distance, a Hamming distance, an included angle cosine distance and a K-L divergence; the evaluation mode comprises internal evaluation and external evaluation.

When a user selects document clustering, the user can input selected command line parameters through an external interface, and for a clustering algorithm, single-pass clustering is defaulted; for the word vector coding mode, the default is simhash coding; defaulting a text distance measurement mode to be Euclidean distance; internal evaluation is defaulted to the evaluation mode.

The similarity of the clustered texts is judged by the simhash coding based on a text distance measurement mode; the word2vector is superposed by word vectors and then averaged to obtain a sentence vector, and the sentence vector is superposed and then averaged to obtain a document vector; the bert vector provides WEB services in an HTTP mode through a WEB service layer at a DOCKER container level on the basis of a word2vector so as to meet the requirement of a plurality of task request services and perform efficiently.

The text clustering method shown in fig. 1 includes the following steps:

reading text data: reading text information from a text file of an appointed directory, and reading a text data file needing clustering under the appointed directory;

preprocessing text information: word embedding is completed in a word vector encoding mode to obtain a word vector, and the main preprocessing comprises the following steps:

word segmentation (if the document is an English document, no word segmentation is performed, but TOKEN processing is required);

removing stop words;

directly calculating the simhash code of the document without the stop word by adopting a simhash coding mode;

word2vector mode is adopted for word embedding, and then the vector of the document without stop words is directly calculated, wherein the specific mode is that the vector of the document is superposed by sentence vectors and then averaged, and the sentence vector is superposed by word vectors and then averaged;

the word embedding is carried out by adopting a BERT vector mode, the basic principle is the same as that of using a word2vector mode, but because the BERT model has a large scale, if the word2vector is used for direct calling, services can not be efficiently and simultaneously requested for a plurality of tasks, so that a container-level WEB service is specially built for using the BERT model by using a DOCKER container technology, and the WEB service based on the Restful style is provided in an HTTP form through a WEB service layer at the DOCKER container level;

text data clustering treatment: clustering and effect evaluation are carried out on the preprocessed word vectors by combining at least one clustering algorithm and an internal or external clustering algorithm evaluation system; the text clustering algorithm provides at least 4 kinds of clustering algorithm realization, which are respectively based on K mean value, hierarchy, density, single pass and other algorithms, and simultaneously provides 2 large-class clustering algorithm evaluation systems, an external evaluation system and an internal evaluation system;

and (3) outputting a clustering result: simultaneously providing clustering result output in an EXCEL format and clustering result output in a graphical interface; further, when an internal evaluation hierarchy is used, an internal evaluation index value for the clustering effect may be output.

The word vectors are used for word embedding representation in the text information preprocessing stage, and all the word vectors adopt floating point vectors of 200 dimensions.

The preset structure body comprises a map structure body for storing the clustering result and a list structure body for storing the intermediate result of the clustering.

A text clustering system for realizing the text clustering method comprises a data extraction module, a clustering operation module and a clustering result output module; the clustering operation module comprises a preprocessing module, a clustering algorithm execution module and a clustering effect evaluation module.

The data extraction module reads text information from the text file of the specified directory and sends the text information to the preprocessing module;

When the clustering algorithm execution module shown in fig. 2 executes normally, the clustering effect evaluation module evaluates the clustering effect through an internal or external clustering algorithm evaluation system; when the text data set is too large and the memory overflows, all the internal data and the state of the system are rolled back to the state before the error, and the error condition of the clustering algorithm is recorded.

The text clustering system further comprises a log management module, wherein the log management module records error information, the error information comprises error time, error level, error reason and error place, and the error place is displayed by using recursive calling.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims.

Claims

1. The method is characterized in that command line parameters input by an external interface and text information read under an appointed directory are preprocessed and then a preset structure body is called through an internal interface to finish clustering the text data under the appointed directory, an EXCEL file or a graphical interface clustering result under the appointed directory is output, and a clustering effect is evaluated.

2. The method for rapidly clustering massive text data according to claim 1, wherein the command line parameters comprise a clustering algorithm, a word vector encoding mode, a text distance measurement mode and an evaluation mode;

the evaluation mode comprises internal evaluation and external evaluation.

3. The method for rapidly clustering massive text data according to claim 2, wherein the clustering method comprises the following steps:

4. The method for rapidly clustering massive text data according to claim 2, wherein the simhash code judges the similarity of clustered texts based on a text distance measurement mode; the word2vector is superposed by word vectors and then averaged to obtain a sentence vector, and the sentence vector is superposed and then averaged to obtain a document vector; the bert vector provides WEB services in an HTTP mode through a WEB service layer at a DOCKER container level on the basis of a word2vector so as to meet the requirement of a plurality of task request services and perform efficiently.

5. The method for rapidly clustering massive text data according to claim 2, wherein in the word vector encoding mode, all word vectors adopt 200-dimensional floating point vectors.

6. The method for rapidly clustering massive text data according to claim 3 or 4, wherein the text information preprocessing step comprises the following substeps:

s2: removing stop words;

s3: calculating a simhash code of the document without the stop word;

s5: and embedding words by adopting a bert vector mode to obtain a word vector.

7. The method for rapidly clustering massive text data according to claim 3 or 5, wherein the word vector is used for word embedding representation in a text information preprocessing stage.

8. The method for rapidly clustering massive text data according to claim 1, wherein the structure comprises a map structure and a list structure, the list structure stores intermediate results of clustering, and the map structure stores clustering results.

9. The system for rapidly clustering the mass text data is characterized by comprising a data extraction module, a clustering operation module and a clustering result output module; the clustering operation module comprises a preprocessing module, a clustering algorithm execution module and a clustering effect evaluation module.

10. The system for rapidly clustering massive text data according to claim 9, wherein the data extraction module reads text information from a text file of a designated directory and sends the text information to the preprocessing module;