CN111061880A - Method for rapidly clustering massive text data - Google Patents

Method for rapidly clustering massive text data Download PDF

Info

Publication number
CN111061880A
CN111061880A CN201911347726.1A CN201911347726A CN111061880A CN 111061880 A CN111061880 A CN 111061880A CN 201911347726 A CN201911347726 A CN 201911347726A CN 111061880 A CN111061880 A CN 111061880A
Authority
CN
China
Prior art keywords
clustering
word
text
text data
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911347726.1A
Other languages
Chinese (zh)
Inventor
陈泽勇
张治同
李志强
姚松
张莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Dippmann Information Technology Co Ltd
Sichuan University
Original Assignee
Chengdu Dippmann Information Technology Co Ltd
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Dippmann Information Technology Co Ltd, Sichuan University filed Critical Chengdu Dippmann Information Technology Co Ltd
Priority to CN201911347726.1A priority Critical patent/CN111061880A/en
Publication of CN111061880A publication Critical patent/CN111061880A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for rapidly clustering massive text data, which comprises the steps of preprocessing command line parameters input from the outside and text information read under an appointed directory, calling a preset structure body through an internal interface to finish clustering the text data under the appointed directory, outputting an EXCEL file or a graphical interface clustering result under the appointed directory, and evaluating a clustering effect; the method specifically comprises the following steps: reading text data, preprocessing text information, clustering the text data and outputting a clustering result; wherein the text information preprocessing comprises: s1: performing word segmentation on the Chinese document, and performing TOKEN processing on the English document; s2: removing stop words; s3: calculating a simhash code of the document without the stop word; s4: word embedding is carried out in a word2vector mode, and a document vector with stop words removed is calculated; s5: word embedding is carried out in a bert vector mode to obtain a word vector; the method realizes the optimal clustering algorithm strategy of the clustering algorithm through internal or external evaluation.

Description

Method for rapidly clustering massive text data
Technical Field
The invention relates to the field of text clustering, in particular to a method for quickly clustering massive text data.
Background
The text clustering is to carry on the aggregation of the similar document according to the characteristic that the similarity of the same document is great, the similarity of different documents is minor, the clustering method is concentrated on improving the accuracy rate of clustering at present, to clustering efficiency, it is long to process the text data of thousands, ten thousand grades of quantity, and the accuracy rate of the general clustering algorithm can't meet the clustering demand of such a huge amount of text data too, because adopt different clustering algorithms to cluster the influence of accuracy rate and clustering time different, therefore, in the clustering assessment to a certain clustering algorithm, can regard clustering accuracy rate and clustering time as the observation point, in order to carry on effective judgement to the clustering effect, thus carry on the high-efficient clustering algorithm selection for different clustering demands; at present, the clustering process and the evaluation of the clustering effect of the text data with the quantity of thousands or tens of thousands are relatively lacked.
Disclosure of Invention
The invention aims to provide a method for quickly clustering mass text data, which is suitable for being executed in computer equipment, commands line parameters input by an external interface and text information read under an appointed directory are preprocessed, a preset structure body is called through an internal interface to finish clustering the text data under the appointed directory, EXCEL files or graphical interface clustering results under the appointed directory are output, and the clustering effect is evaluated.
Further, the command line parameters comprise a clustering algorithm, a word vector coding mode, a text distance measurement mode and an evaluation mode;
the clustering algorithm comprises a K mean value, single-pass clustering, hierarchical clustering and density clustering;
the word vector coding mode comprises a simhash code, a word2vector and a bert vector;
the text distance measurement mode comprises an Euclidean distance, a Hamming distance, an included angle cosine distance and a K-L divergence;
the evaluation mode comprises internal evaluation and external evaluation.
Further, the clustering method comprises the following steps:
reading text data: reading text information from a text file of a specified directory;
preprocessing text information: completing word embedding through a word vector coding mode to obtain a word vector;
text data clustering treatment: clustering and effect evaluation are carried out on the preprocessed word vectors by combining at least one clustering algorithm and an internal or external clustering algorithm evaluation system;
and (3) outputting a clustering result: and outputting the clustering result in an EXCEL format or a graphical interface.
Further, the similarity of the clustered texts is judged by the simhash coding based on a text distance measurement mode; the word2vector is superposed by word vectors and then averaged to obtain a sentence vector, and the sentence vector is superposed and then averaged to obtain a document vector; the bert vector provides WEB services in an HTTP mode through a WEB service layer at a DOCKER container level on the basis of a word2vector so as to meet the requirement of a plurality of task request services and perform efficiently.
Further, in the word vector encoding method, all word vectors adopt floating point vectors of 200 dimensions.
Further, the text information preprocessing step includes the following sub-steps:
s1: performing word segmentation on the Chinese document, and performing TOKEN processing on the English document;
s2: removing stop words;
s3: calculating a simhash code of the document without the stop word;
s4: word embedding is carried out in a word2vector mode, and a document vector with stop words removed is calculated;
s5: and embedding words by adopting a bert vector mode to obtain a word vector.
Further, the word vector is used for word embedding representation in a text information preprocessing stage.
Further, the structure body comprises a map structure body and a list structure body, the list structure body stores the intermediate clustering result, and the map structure body stores the clustering result.
The system for rapidly clustering the mass text data comprises a data extraction module, a clustering operation module and a clustering result output module; the clustering operation module comprises a preprocessing module, a clustering algorithm execution module and a clustering effect evaluation module.
Further, the data extraction module reads text information from a text file of the specified directory and sends the text information to the preprocessing module;
the preprocessing module receives command line parameters input by an external interface and text information sent by the data extraction module and then executes a text information preprocessing step, and sends the command line parameters and preprocessed word vectors to the clustering algorithm execution module;
the clustering algorithm execution module calls a preset structure body through an internal interface to finish clustering the text data under the specified directory; the clustering effect evaluation module evaluates the clustering effect through an internal or external clustering algorithm evaluation system;
the clustering result output module outputs EXCEL files or graphical interface clustering results under a specified directory; when an internal evaluation system is used, an internal evaluation index value for the clustering effect is output.
The invention has the beneficial effects that: the method can perform clustering operation on a given text file data set, and achieves clustering efficiency that the clustering time of ten thousand-level quantity of text file data is less than or equal to 10 minutes and the clustering time of thousand-level quantity of text file data is less than or equal to 1 minute through not less than 4 conventional clustering algorithms, not less than 3 text distance measurement modes and 2 different clustering result evaluation modes; meanwhile, the accuracy and the clustering efficiency of the clustering algorithm are judged through an internal or external clustering algorithm evaluation system, and an optimal clustering algorithm strategy is realized.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a control chart for implementing the clustering algorithm of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
A method for quickly clustering mass text data is suitable for being executed in computer equipment, command line parameters input by an external interface and text information read under an appointed directory are preprocessed, then a preset structural body is called through an internal interface to finish clustering the text data under the appointed directory, an EXCEL file or a graphical interface clustering result under the appointed directory is output, and a clustering effect is evaluated.
The command line parameters comprise a clustering algorithm, a word vector coding mode, a text distance measurement mode and an evaluation mode; the clustering algorithm comprises a K mean value, single-pass clustering, hierarchical clustering and density clustering; the word vector coding mode comprises a simhash code, a word2vector and a bert vector; the text distance measurement mode comprises an Euclidean distance, a Hamming distance, an included angle cosine distance and a K-L divergence; the evaluation mode comprises internal evaluation and external evaluation.
When a user selects document clustering, the user can input selected command line parameters through an external interface, and for a clustering algorithm, single-pass clustering is defaulted; for the word vector coding mode, the default is simhash coding; defaulting a text distance measurement mode to be Euclidean distance; internal evaluation is defaulted to the evaluation mode.
The similarity of the clustered texts is judged by the simhash coding based on a text distance measurement mode; the word2vector is superposed by word vectors and then averaged to obtain a sentence vector, and the sentence vector is superposed and then averaged to obtain a document vector; the bert vector provides WEB services in an HTTP mode through a WEB service layer at a DOCKER container level on the basis of a word2vector so as to meet the requirement of a plurality of task request services and perform efficiently.
The text clustering method shown in fig. 1 includes the following steps:
reading text data: reading text information from a text file of an appointed directory, and reading a text data file needing clustering under the appointed directory;
preprocessing text information: word embedding is completed in a word vector encoding mode to obtain a word vector, and the main preprocessing comprises the following steps:
word segmentation (if the document is an English document, no word segmentation is performed, but TOKEN processing is required);
removing stop words;
directly calculating the simhash code of the document without the stop word by adopting a simhash coding mode;
word2vector mode is adopted for word embedding, and then the vector of the document without stop words is directly calculated, wherein the specific mode is that the vector of the document is superposed by sentence vectors and then averaged, and the sentence vector is superposed by word vectors and then averaged;
the word embedding is carried out by adopting a BERT vector mode, the basic principle is the same as that of using a word2vector mode, but because the BERT model has a large scale, if the word2vector is used for direct calling, services can not be efficiently and simultaneously requested for a plurality of tasks, so that a container-level WEB service is specially built for using the BERT model by using a DOCKER container technology, and the WEB service based on the Restful style is provided in an HTTP form through a WEB service layer at the DOCKER container level;
text data clustering treatment: clustering and effect evaluation are carried out on the preprocessed word vectors by combining at least one clustering algorithm and an internal or external clustering algorithm evaluation system; the text clustering algorithm provides at least 4 kinds of clustering algorithm realization, which are respectively based on K mean value, hierarchy, density, single pass and other algorithms, and simultaneously provides 2 large-class clustering algorithm evaluation systems, an external evaluation system and an internal evaluation system;
and (3) outputting a clustering result: simultaneously providing clustering result output in an EXCEL format and clustering result output in a graphical interface; further, when an internal evaluation hierarchy is used, an internal evaluation index value for the clustering effect may be output.
The word vectors are used for word embedding representation in the text information preprocessing stage, and all the word vectors adopt floating point vectors of 200 dimensions.
The preset structure body comprises a map structure body for storing the clustering result and a list structure body for storing the intermediate result of the clustering.
A text clustering system for realizing the text clustering method comprises a data extraction module, a clustering operation module and a clustering result output module; the clustering operation module comprises a preprocessing module, a clustering algorithm execution module and a clustering effect evaluation module.
The data extraction module reads text information from the text file of the specified directory and sends the text information to the preprocessing module;
the preprocessing module receives command line parameters input by an external interface and text information sent by the data extraction module and then executes a text information preprocessing step, and sends the command line parameters and preprocessed word vectors to the clustering algorithm execution module;
the clustering algorithm execution module calls a preset structure body through an internal interface to finish clustering the text data under the specified directory; the clustering effect evaluation module evaluates the clustering effect through an internal or external clustering algorithm evaluation system;
the clustering result output module outputs EXCEL files or graphical interface clustering results under a specified directory; when an internal evaluation system is used, an internal evaluation index value for the clustering effect is output.
When the clustering algorithm execution module shown in fig. 2 executes normally, the clustering effect evaluation module evaluates the clustering effect through an internal or external clustering algorithm evaluation system; when the text data set is too large and the memory overflows, all the internal data and the state of the system are rolled back to the state before the error, and the error condition of the clustering algorithm is recorded.
The text clustering system further comprises a log management module, wherein the log management module records error information, the error information comprises error time, error level, error reason and error place, and the error place is displayed by using recursive calling.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims.

Claims (10)

1. The method is characterized in that command line parameters input by an external interface and text information read under an appointed directory are preprocessed and then a preset structure body is called through an internal interface to finish clustering the text data under the appointed directory, an EXCEL file or a graphical interface clustering result under the appointed directory is output, and a clustering effect is evaluated.
2. The method for rapidly clustering massive text data according to claim 1, wherein the command line parameters comprise a clustering algorithm, a word vector encoding mode, a text distance measurement mode and an evaluation mode;
the clustering algorithm comprises a K mean value, single-pass clustering, hierarchical clustering and density clustering;
the word vector coding mode comprises a simhash code, a word2vector and a bert vector;
the text distance measurement mode comprises an Euclidean distance, a Hamming distance, an included angle cosine distance and a K-L divergence;
the evaluation mode comprises internal evaluation and external evaluation.
3. The method for rapidly clustering massive text data according to claim 2, wherein the clustering method comprises the following steps:
reading text data: reading text information from a text file of a specified directory;
preprocessing text information: completing word embedding through a word vector coding mode to obtain a word vector;
text data clustering treatment: clustering and effect evaluation are carried out on the preprocessed word vectors by combining at least one clustering algorithm and an internal or external clustering algorithm evaluation system;
and (3) outputting a clustering result: and outputting the clustering result in an EXCEL format or a graphical interface.
4. The method for rapidly clustering massive text data according to claim 2, wherein the simhash code judges the similarity of clustered texts based on a text distance measurement mode; the word2vector is superposed by word vectors and then averaged to obtain a sentence vector, and the sentence vector is superposed and then averaged to obtain a document vector; the bert vector provides WEB services in an HTTP mode through a WEB service layer at a DOCKER container level on the basis of a word2vector so as to meet the requirement of a plurality of task request services and perform efficiently.
5. The method for rapidly clustering massive text data according to claim 2, wherein in the word vector encoding mode, all word vectors adopt 200-dimensional floating point vectors.
6. The method for rapidly clustering massive text data according to claim 3 or 4, wherein the text information preprocessing step comprises the following substeps:
s1: performing word segmentation on the Chinese document, and performing TOKEN processing on the English document;
s2: removing stop words;
s3: calculating a simhash code of the document without the stop word;
s4: word embedding is carried out in a word2vector mode, and a document vector with stop words removed is calculated;
s5: and embedding words by adopting a bert vector mode to obtain a word vector.
7. The method for rapidly clustering massive text data according to claim 3 or 5, wherein the word vector is used for word embedding representation in a text information preprocessing stage.
8. The method for rapidly clustering massive text data according to claim 1, wherein the structure comprises a map structure and a list structure, the list structure stores intermediate results of clustering, and the map structure stores clustering results.
9. The system for rapidly clustering the mass text data is characterized by comprising a data extraction module, a clustering operation module and a clustering result output module; the clustering operation module comprises a preprocessing module, a clustering algorithm execution module and a clustering effect evaluation module.
10. The system for rapidly clustering massive text data according to claim 9, wherein the data extraction module reads text information from a text file of a designated directory and sends the text information to the preprocessing module;
the preprocessing module receives command line parameters input by an external interface and text information sent by the data extraction module and then executes a text information preprocessing step, and sends the command line parameters and preprocessed word vectors to the clustering algorithm execution module;
the clustering algorithm execution module calls a preset structure body through an internal interface to finish clustering the text data under the specified directory; the clustering effect evaluation module evaluates the clustering effect through an internal or external clustering algorithm evaluation system;
the clustering result output module outputs EXCEL files or graphical interface clustering results under a specified directory; when an internal evaluation system is used, an internal evaluation index value for the clustering effect is output.
CN201911347726.1A 2019-12-24 2019-12-24 Method for rapidly clustering massive text data Pending CN111061880A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911347726.1A CN111061880A (en) 2019-12-24 2019-12-24 Method for rapidly clustering massive text data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911347726.1A CN111061880A (en) 2019-12-24 2019-12-24 Method for rapidly clustering massive text data

Publications (1)

Publication Number Publication Date
CN111061880A true CN111061880A (en) 2020-04-24

Family

ID=70303132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911347726.1A Pending CN111061880A (en) 2019-12-24 2019-12-24 Method for rapidly clustering massive text data

Country Status (1)

Country Link
CN (1) CN111061880A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408266A (en) * 2020-12-02 2021-09-17 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929894A (en) * 2011-08-12 2013-02-13 中国人民解放军总参谋部第五十七研究所 Online clustering visualization method of text
CN104008092A (en) * 2014-06-10 2014-08-27 复旦大学 Method and system of relation characterizing, clustering and identifying based on the semanteme of semantic space mapping
CN106776713A (en) * 2016-11-03 2017-05-31 中山大学 It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN106951498A (en) * 2017-03-15 2017-07-14 国信优易数据有限公司 Text clustering method
CN108647319A (en) * 2018-05-10 2018-10-12 思派(北京)网络科技有限公司 A kind of labeling system and its method based on short text clustering
CN109558482A (en) * 2018-07-27 2019-04-02 中山大学 A kind of parallel method of the text cluster model PW-LDA based on Spark frame

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929894A (en) * 2011-08-12 2013-02-13 中国人民解放军总参谋部第五十七研究所 Online clustering visualization method of text
CN104008092A (en) * 2014-06-10 2014-08-27 复旦大学 Method and system of relation characterizing, clustering and identifying based on the semanteme of semantic space mapping
CN106776713A (en) * 2016-11-03 2017-05-31 中山大学 It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN106951498A (en) * 2017-03-15 2017-07-14 国信优易数据有限公司 Text clustering method
CN108647319A (en) * 2018-05-10 2018-10-12 思派(北京)网络科技有限公司 A kind of labeling system and its method based on short text clustering
CN109558482A (en) * 2018-07-27 2019-04-02 中山大学 A kind of parallel method of the text cluster model PW-LDA based on Spark frame

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408266A (en) * 2020-12-02 2021-09-17 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106384282A (en) Method and device for building decision-making model
CN112016318B (en) Triage information recommendation method, device, equipment and medium based on interpretation model
CN110728313B (en) Classification model training method and device for intention classification recognition
CN111125658A (en) Method, device, server and storage medium for identifying fraudulent users
CN116821646A (en) Data processing chain construction method, data reduction method, device, equipment and medium
CN112070550A (en) Keyword determination method, device and equipment based on search platform and storage medium
CN113763385A (en) Video object segmentation method, device, equipment and medium
CN111061880A (en) Method for rapidly clustering massive text data
CN111310462A (en) User attribute determination method, device, equipment and storage medium
CN113110843B (en) Contract generation model training method, contract generation method and electronic equipment
CN117370650A (en) Cloud computing data recommendation method based on service combination hypergraph convolutional network
CN112700450A (en) Image segmentation method and system based on ensemble learning
CN110019169A (en) A kind of method and device of data processing
CN115345600B (en) RPA flow generation method and device
CN115544033B (en) Method, device, equipment and medium for updating check repeat vector library and checking repeat data
CN110175262A (en) Deep learning model compression method, storage medium and system based on cluster
CN116185797A (en) Method, device and storage medium for predicting server resource saturation
CN115796548A (en) Resource allocation method, device, computer equipment, storage medium and product
CN115858648A (en) Database generation method, data stream segmentation method, device, equipment and medium
CN111737319B (en) User cluster prediction method, device, computer equipment and storage medium
CN115686597A (en) Data processing method and device, electronic equipment and storage medium
CN113220992A (en) Information flow content recommendation method, system and medium
CN113254428A (en) Missing data filling method and system based on decision tree
CN118332104A (en) Document digest generation method, computer device, storage medium, and program product
CN118035221A (en) Data cleaning processing method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200424

RJ01 Rejection of invention patent application after publication