CN113378950A - Unsupervised classification method for long texts - Google Patents

Unsupervised classification method for long texts Download PDF

Info

Publication number
CN113378950A
CN113378950A CN202110691284.3A CN202110691284A CN113378950A CN 113378950 A CN113378950 A CN 113378950A CN 202110691284 A CN202110691284 A CN 202110691284A CN 113378950 A CN113378950 A CN 113378950A
Authority
CN
China
Prior art keywords
text
long
extracting
word
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110691284.3A
Other languages
Chinese (zh)
Inventor
林正春
兰林
陈功文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Chace Network Information Technology Co ltd
Original Assignee
Shenzhen Chace Network Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Chace Network Information Technology Co ltd filed Critical Shenzhen Chace Network Information Technology Co ltd
Priority to CN202110691284.3A priority Critical patent/CN113378950A/en
Publication of CN113378950A publication Critical patent/CN113378950A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an unsupervised classification method of long texts, which comprises the following steps: filtering the long text to be classified, and extracting three parts of a title text, a body text and a text of a text-sending department in the long text to be classified; extracting weight coefficients of the title text, the body text and the text of the text sending department; fusing the title text, the body text and the text of the text department of the text issue into a new long text T according to the extracted weight coefficient; performing Chinese word segmentation on the new long text T, and extracting word segmentation information; inputting the word segmentation information into a word vector model to obtain word vector information; calculating a feature vector of the long text T according to the word vector information; and clustering the feature vectors of the long text T to obtain a text classification. According to the invention, the method for classifying the long text is improved, the time complexity of long text classification is reduced, the accuracy of long text classification is improved, and the user can read and classify the long text more conveniently.

Description

Unsupervised classification method for long texts
Technical Field
The invention relates to the technical field of network information, in particular to an unsupervised long text classification method.
Background
The nation and government have developed various texts required by developing enterprises in order to support their better development. Under the guidance of related texts, enterprise development can more directly and accurately understand government guidance and understand the market to a great extent, so that products meeting the market requirements better are produced. A government preferential support text relates to multiple aspects of department recruitment, tax deduction, financing support, environment optimization, recruitment and intelligence introduction and the like, and directly or indirectly promotes the healthy development of enterprises. Various texts of the country and the government become important bases for planning the development and development of enterprises.
The accurate reading and classification of texts, particularly long texts, has become an important development subject which needs to be solved urgently in enterprise development. Because the long text has the characteristics of longer text and certain difference between the text structure and word-using habit and the common text, the time complexity is high when classification is executed, and the classification accuracy is low. There is relatively little current research on long-text classification.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides an unsupervised long text classification method, reduces the time complexity of long text classification, improves the accuracy of long text classification, and is convenient for reading and classifying long texts.
The purpose of the invention is realized by the following technical scheme:
an unsupervised classification method of long texts comprises the following steps:
(1) filtering the long text to be classified, and extracting the title text t in the long text to be classified1Text t2And text t of the issue department3Three parts;
(2) extracting the heading text t1Text t2And text t of the issue department3Weight coefficient c of three parts1、c2、c3
(3) Extracting the title text t according to the extracted weight coefficient1Text t2And text t of the issue department3Fused into a new lengthA text T;
(4) performing Chinese word segmentation on the new long text T, and extracting word segmentation information; inputting the word segmentation information into a word vector model to obtain word vector information;
(5) calculating a feature vector of the long text T according to the word vector information;
(6) and clustering the feature vectors of the long text T to obtain a text classification.
Further, the step (2) is specifically as follows:
2.1, for the title text t1Text t2And text t of the issue department3Performing a pretreatment using x2Respectively selecting characteristic words of each text by a statistical method;
2.2 Using TF-IDF and based on x, according to the characteristic words of the respective text2And respectively extracting the weight coefficient of each text by using a characteristic evaluation function of the statistical method.
Further, the step (3) is specifically as follows:
3.1, calculating each text t respectivelyiLength L ofiTaking L as max (L)1,L2,L3),c=max(c1, c2,c3) Wherein i is 1,2, 3;
3.2, text tiReplication of L/Li*ciC parts and are linked together to give the text Ti
3.3, text T1、T2、T3The sequential concatenation together results in a new long text T.
Further, in the step (4), the open source chinese word segmentation tool HanLP used performs chinese word segmentation on the new long text T, removes stop words, and extracts word segmentation information.
Further, in the step (4), the word vector model is a CBOW model and is constructed by a word2vec module of the generic package.
Further, the step (5) is specifically as follows:
5.1, sequencing the vectors in the word vector information in an ascending order, and taking the first N vectors, wherein N is less than the total number of the vectors in the word vector information;
and 5.2, calculating the centers of the first N vectors to obtain the feature vector of the long text T.
The invention has the beneficial effects that: the extracted texts are fused into a new long text by extracting the weight coefficients of the title text, the body text and the text of the text department; and extracting the weight coefficient of each text according to the feature words of each text, so that the fused new long text comprises the feature word pairs and the associated information between the feature words, and the accuracy of text classification is improved. The method comprises the steps of carrying out Chinese word segmentation on a new long text, extracting word segmentation information, filtering out irrelevant information, inputting the word segmentation information into a word vector model to obtain accurate word vector information, measuring semantic similarity of sentences, accurately calculating similarity between texts and facilitating classification of similar texts. According to the word vector information, the feature vectors of the long text T are calculated, clustering is carried out on the feature vectors to obtain text classification, the method for classifying the long text is improved through the weight coefficients and the feature vectors, and the precision and the effect of the method for classifying the long text are improved.
Drawings
Fig. 1 is a schematic structural diagram of an unsupervised long text classification method according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
An unsupervised classification method of long texts comprises the following steps:
(1) because the long text is long, the text structure mainly comprises three parts of a title, a text and a text sending department, before the long text to be classified is classified, the long text to be classified needs to be filtered, and the title text t in the long text to be classified is extracted1Text t2And text t of the issue department3Three parts.
(2) The title text t is then extracted1Text t2And text t of the issue department3Weight coefficient c of three parts1、c2、c3. The weight coefficient is used for indicating that each text is waitingAnd measuring the important degree index in the text.
The preferred technical scheme is to title text t1Text t2And text t of the issue department3Performing a pretreatment using x2Respectively selecting characteristic words of each text by a statistical method; using TF-IDF and x-based on the feature words of each text2The feature evaluation function of the statistical method respectively extracts the weight coefficient of each text, and the associated information between feature words can be obtained by extracting the weight coefficient of each text through the feature words, so that the accuracy of text classification is improved.
(3) The title text t is weighted according to the weight coefficient1Text t2And text t of the issue department3The method is fused into a new long text T, and comprises the characteristic word pairs and the associated information between the characteristic words, so that the text classification accuracy can be effectively improved.
The preferred technical scheme is to respectively calculate each text tiLength L ofiTaking L as max (L)1,L2,L3),c =max(c1,c2,c3) Wherein i is 1,2, 3; text tiReplication of L/Li*ciC parts and are linked together to give the text Ti(ii) a Text T1、T2、T3The sequential concatenation together results in a new long text T. Adjusting the length of the extracted text to extract the title text t1Text t2And text t of the issue department3Fusing into a new long text T that is easier to interpret and classify.
(4) Carrying out Chinese word segmentation on the new long text T, extracting word segmentation information, and filtering out irrelevant information; the word segmentation information is input into a word vector model to obtain word vector information which is used for measuring semantic similarity of sentences, accurately calculating similarity between texts and facilitating classification of similar texts.
According to the preferable technical scheme, an open source Chinese word segmentation tool HanLP is used for carrying out Chinese word segmentation on a new long text T, stop words are removed, and word segmentation information is extracted. The open source Chinese word segmentation tool HanLP is a tool kit consisting of a series of model pre-algorithms, can provide functions of lexical analysis, syntactic analysis, text analysis, emotion analysis and the like, has the characteristics of complete functions, high performance, clear architecture, new linguistic data, customization and the like, adopts a series of high-speed data structures at the bottom layer, has a word segmentation rate of 2,000 ten thousand characters per second, and only needs 120MB relatively using a memory; in the aspect of IO, the dictionary loading speed is very high, the dictionary can be quickly started only by 500ms, and the using efficiency and the applicability of the dictionary are more suitable for Chinese word segmentation of long texts.
According to the preferred technical scheme, the word vector model is a CBOW model, the word2vec module of the generic package is used for construction, the model is simple to construct, and word vectorization is convenient to carry out. In the cbow model, the central word is predicted through the peripheral words, so that the vector of the peripheral words is continuously adjusted by utilizing the prediction result condition of the central word and using a GradientDesent method; after word segmentation information is input into the CBOW model, each word can be used as a central word, the CBOW model performs unified adjustment on surrounding words, word vectors of all words in the whole text are obtained, the obtained gradient value can be applied to the word vectors of the surrounding words, the number of times of predicting behaviors of the CBOW model is almost equal to the number of words in the whole text, the efficiency is higher, and the speed is higher.
(5) Calculating a feature vector of the long text T according to the word vector information;
the preferred technical scheme is that the vectors in the word vector information are sorted in an ascending order, the first N vectors are taken, and N is less than the total number of the vectors in the word vector information; and calculating the centers of the first N vectors to obtain the feature vector of the long text T.
(6) The feature vectors of the long text T are clustered to obtain the text classification, the accuracy is higher, and the time complexity of classifying the long text is lower.
The invention solves the problems of high time complexity and low classification accuracy when classification is executed due to long text with long text, text structure and word using habit, further improves the precision and effect of the long text classification method by improving the long text classification method, and is more suitable for reading and classifying the long text.
It should be understood that the above-described embodiments are merely preferred examples of the present invention and the technical principles applied thereto, and any changes, modifications, substitutions, combinations and simplifications made by those skilled in the art without departing from the spirit and principle of the present invention shall be covered by the protection scope of the present invention.

Claims (6)

1. An unsupervised classification method of long texts is characterized by comprising the following steps:
(1) filtering the long text to be classified, and extracting the title text t in the long text to be classified1Text t2And text t of the issue department3Three parts;
(2) extracting the heading text t1Text t2And text t of the issue department3Weight coefficient c of three parts1、c2、c3
(3) Extracting the title text t according to the extracted weight coefficient1Text t2And text t of the issue department3Fusing to form a new long text T;
(4) performing Chinese word segmentation on the new long text T, and extracting word segmentation information; inputting the word segmentation information into a word vector model to obtain word vector information;
(5) calculating a feature vector of the long text T according to the word vector information;
(6) and clustering the feature vectors of the long text T to obtain a text classification.
2. The unsupervised classification method for long texts according to claim 1, wherein the step (2) is specifically as follows:
2.1, for the title text t1Text t2And text t of the issue department3Performing a pretreatment using x2Respectively selecting characteristic words of each text by a statistical method;
2.2 Using TF-IDF and based on x, according to the characteristic words of the respective text2Respectively extracting each characteristic evaluation function of the statistical methodThe weight coefficient of each text.
3. The unsupervised classification method for long texts according to claim 1, wherein the step (3) is specifically as follows:
3.1, calculating each text t respectivelyiLength L ofiTaking L as max (L)1,L2,L3),c=max(c1,c2,c3) Wherein i is 1,2, 3;
3.2, text tiReplication of L/Li*ciC parts and are linked together to give the text Ti
3.3, text T1、T2、T3The sequential concatenation together results in a new long text T.
4. The unsupervised classification method for long texts as claimed in claim 1, wherein in the step (4), the open source chinese segmentation tool HanLP is used to perform chinese segmentation on the new long text T, and remove stop words to extract segmentation information.
5. The unsupervised classification method for long texts according to claim 1, wherein in the step (4), the word vector model is a CBOW model and is constructed by word2vec module of gensim package.
6. The unsupervised classification method for long texts according to claim 1, wherein the step (5) is specifically as follows:
5.1, sequencing the vectors in the word vector information in an ascending order, and taking the first N vectors, wherein N is less than the total number of the vectors in the word vector information;
and 5.2, calculating the centers of the first N vectors to obtain the feature vector of the long text T.
CN202110691284.3A 2021-06-22 2021-06-22 Unsupervised classification method for long texts Pending CN113378950A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110691284.3A CN113378950A (en) 2021-06-22 2021-06-22 Unsupervised classification method for long texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110691284.3A CN113378950A (en) 2021-06-22 2021-06-22 Unsupervised classification method for long texts

Publications (1)

Publication Number Publication Date
CN113378950A true CN113378950A (en) 2021-09-10

Family

ID=77578220

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110691284.3A Pending CN113378950A (en) 2021-06-22 2021-06-22 Unsupervised classification method for long texts

Country Status (1)

Country Link
CN (1) CN113378950A (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method
WO2017190527A1 (en) * 2016-05-06 2017-11-09 华为技术有限公司 Text data classification method and server
CN109522408A (en) * 2018-10-30 2019-03-26 广东原昇信息科技有限公司 The classification method of information streaming material intention text
US20190095432A1 (en) * 2017-09-26 2019-03-28 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for building text classification model, and text classification method and apparatus
CN110287314A (en) * 2019-05-20 2019-09-27 中国科学院计算技术研究所 Long text credibility evaluation method and system based on Unsupervised clustering
WO2019214133A1 (en) * 2018-05-08 2019-11-14 华南理工大学 Method for automatically categorizing large-scale customer complaint data
CN111324728A (en) * 2020-01-22 2020-06-23 腾讯科技(深圳)有限公司 Text event abstract generation method and device, electronic equipment and storage medium
CN111694958A (en) * 2020-06-05 2020-09-22 深兰人工智能芯片研究院(江苏)有限公司 Microblog topic clustering method based on word vector and single-pass fusion
CN111767741A (en) * 2020-06-30 2020-10-13 福建农林大学 Text emotion analysis method based on deep learning and TFIDF algorithm
CN112287105A (en) * 2020-09-30 2021-01-29 昆明理工大学 Method for analyzing correlation of law-related news fusing bidirectional mutual attention of title and text
CN112347255A (en) * 2020-11-06 2021-02-09 天津大学 Text classification method based on title and text combination of graph network
CN112364124A (en) * 2020-11-19 2021-02-12 湖南红网新媒体集团有限公司 Text similarity matching and calculating method, system and device
CN112507114A (en) * 2020-11-04 2021-03-16 福州大学 Multi-input LSTM-CNN text classification method and system based on word attention mechanism
EP3816844A1 (en) * 2019-10-29 2021-05-05 Robert Bosch GmbH Computer-implemented method and device for processing data
CN112860898A (en) * 2021-03-16 2021-05-28 哈尔滨工业大学(威海) Short text box clustering method, system, equipment and storage medium

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method
WO2017190527A1 (en) * 2016-05-06 2017-11-09 华为技术有限公司 Text data classification method and server
US20190095432A1 (en) * 2017-09-26 2019-03-28 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for building text classification model, and text classification method and apparatus
WO2019214133A1 (en) * 2018-05-08 2019-11-14 华南理工大学 Method for automatically categorizing large-scale customer complaint data
CN109522408A (en) * 2018-10-30 2019-03-26 广东原昇信息科技有限公司 The classification method of information streaming material intention text
CN110287314A (en) * 2019-05-20 2019-09-27 中国科学院计算技术研究所 Long text credibility evaluation method and system based on Unsupervised clustering
EP3816844A1 (en) * 2019-10-29 2021-05-05 Robert Bosch GmbH Computer-implemented method and device for processing data
CN111324728A (en) * 2020-01-22 2020-06-23 腾讯科技(深圳)有限公司 Text event abstract generation method and device, electronic equipment and storage medium
CN111694958A (en) * 2020-06-05 2020-09-22 深兰人工智能芯片研究院(江苏)有限公司 Microblog topic clustering method based on word vector and single-pass fusion
CN111767741A (en) * 2020-06-30 2020-10-13 福建农林大学 Text emotion analysis method based on deep learning and TFIDF algorithm
CN112287105A (en) * 2020-09-30 2021-01-29 昆明理工大学 Method for analyzing correlation of law-related news fusing bidirectional mutual attention of title and text
CN112507114A (en) * 2020-11-04 2021-03-16 福州大学 Multi-input LSTM-CNN text classification method and system based on word attention mechanism
CN112347255A (en) * 2020-11-06 2021-02-09 天津大学 Text classification method based on title and text combination of graph network
CN112364124A (en) * 2020-11-19 2021-02-12 湖南红网新媒体集团有限公司 Text similarity matching and calculating method, system and device
CN112860898A (en) * 2021-03-16 2021-05-28 哈尔滨工业大学(威海) Short text box clustering method, system, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
但宇豪;黄继风;杨琳;高海;: "基于TF-IDF与word2vec的台词文本分类研究", 上海师范大学学报(自然科学版), no. 01, pages 89 - 95 *
王启发;王中卿;李寿山;周国栋;: "基于交叉注意力机制和新闻正文的评论情感分类", 计算机科学, no. 10, pages 222 - 227 *

Similar Documents

Publication Publication Date Title
CN111767741B (en) Text emotion analysis method based on deep learning and TFIDF algorithm
CN109213861B (en) Traveling evaluation emotion classification method combining At _ GRU neural network and emotion dictionary
CN109829159B (en) Integrated automatic lexical analysis method and system for ancient Chinese text
RU2665239C2 (en) Named entities from the text automatic extraction
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN107180026B (en) Event phrase learning method and device based on word embedding semantic mapping
CN113377897B (en) Multi-language medical term standard standardization system and method based on deep confrontation learning
CN112818118B (en) Reverse translation-based Chinese humor classification model construction method
CN111581474A (en) Evaluation object extraction method of case-related microblog comments based on multi-head attention system
CN109086340A (en) Evaluation object recognition methods based on semantic feature
CN113919366A (en) Semantic matching method and device for power transformer knowledge question answering
CN113051887A (en) Method, system and device for extracting announcement information elements
CN114722835A (en) Text emotion recognition method based on LDA and BERT fusion improved model
Chen et al. Chinese Weibo sentiment analysis based on character embedding with dual-channel convolutional neural network
CN114064901A (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM
CN116108840A (en) Text fine granularity emotion analysis method, system, medium and computing device
CN115934936A (en) Intelligent traffic text analysis method based on natural language processing
CN115906824A (en) Text fine-grained emotion analysis method, system, medium and computing equipment
CN113378950A (en) Unsupervised classification method for long texts
CN115630140A (en) English reading material difficulty judgment method based on text feature fusion
CN113095087B (en) Chinese word sense disambiguation method based on graph convolution neural network
CN115129890A (en) Feedback data map generation method and generation device, question answering device and refrigerator
CN111859910B (en) Word feature representation method for semantic role recognition and fusing position information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination