CN110321434A - A kind of file classification method based on word sense disambiguation convolutional neural networks - Google Patents

A kind of file classification method based on word sense disambiguation convolutional neural networks Download PDF

Info

Publication number
CN110321434A
CN110321434A CN201910565070.4A CN201910565070A CN110321434A CN 110321434 A CN110321434 A CN 110321434A CN 201910565070 A CN201910565070 A CN 201910565070A CN 110321434 A CN110321434 A CN 110321434A
Authority
CN
China
Prior art keywords
word
text
criticality
convolutional neural
neural networks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910565070.4A
Other languages
Chinese (zh)
Inventor
肖清林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central Mdt Infotech Ltd Of United States Of Xiamen
Original Assignee
Central Mdt Infotech Ltd Of United States Of Xiamen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central Mdt Infotech Ltd Of United States Of Xiamen filed Critical Central Mdt Infotech Ltd Of United States Of Xiamen
Priority to CN201910565070.4A priority Critical patent/CN110321434A/en
Publication of CN110321434A publication Critical patent/CN110321434A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A kind of file classification method based on word sense disambiguation convolutional neural networks, comprising the following steps: configuration has determined that the ambiguity dictionary of the meaning of a word;Associated documents are obtained, content of text are extracted from file, and word segmentation processing is carried out to sentence each in text;Determine the part of speech of each word in sentence;It determines and disambiguates target word;It determines the meaning of a word of target word and carries out disambiguation processing;Word segmentation processing is carried out to original statement included in disambiguation hereinafter this and removal stop words is handled, obtains object statement corresponding with original statement;Determine the criticality of word in object statement;Determine the criticality of object statement;Sentence is ranked up according to the criticality of sentence, obtains target text;Classified using the trained textual classification model based on convolutional neural networks to target text.The present invention can carry out text classification based on word sense disambiguation convolutional neural networks, optimize file classification method, improve the accuracy of text classification efficiency and text classification, time saving and energy saving.

Description

A kind of file classification method based on word sense disambiguation convolutional neural networks
Technical field
The present invention relates to Text Classification field more particularly to a kind of texts based on word sense disambiguation convolutional neural networks Classification method.
Background technique
Increasingly flourishing and netizen's quantity with the network media is continuously increased, and a large amount of text data is constantly producing Raw, how handling huge text data and correctly classifying is a urgent problem to be solved, and text classification passes through data with existing Training classifier, and this classifier is used for test document, determine the classification of each document, correct text classification can make User searches out the information of needs, more easily browsing document faster, and text automatic classification refers to by having class formative Then the classifier is used to test unknown classification text and identified by training text, i.e. training text classifier;
In existing technology, file classification method mainly includes following several: rule-based method, a large amount of by counting Text feature and the field relevant knowledge lay down a regulation and pass through rule classification, and this method needs a large amount of time and correlation special Industry personnel;Based on vector space expression, feature is selected and extracted first, text representation is constructed by vector space, then Classifier is constructed, this method has ignored the semantic information of word, and dimension is larger, Yi Yinqi dimension disaster problem;Based on distribution Term vector method, selects and extracts first feature, constructs text representation by the methods of LDA or Word2Vec, later building point Class device, this method only include one kind of global information or local message, and have ignored another information, and classification accuracy is lower;
Current file classification method is complex, and text classification efficiency is slow, and the accuracy of text classification is lower.
Summary of the invention
(1) goal of the invention
To solve technical problem present in background technique, the present invention proposes a kind of based on word sense disambiguation convolutional neural networks File classification method, can based on word sense disambiguation convolutional neural networks carry out text classification, optimize file classification method, mention The high accuracy of text classification efficiency and text classification, it is time saving and energy saving.
(2) technical solution
To solve the above problems, the invention proposes a kind of text classification sides based on word sense disambiguation convolutional neural networks Method, comprising the following steps:
S1, configuration have determined that the ambiguity dictionary of the meaning of a word;
S2, associated documents are obtained, content of text is extracted from file, and word segmentation processing is carried out to sentence each in text;
S3, the part of speech of sentence each in text is labeled, determines the part of speech of each word in sentence;
S4, it is based on ambiguity dictionary, determines and disambiguates target word;
S5, based on to sentence syntactic analysis and contextual information analysis, determine the meaning of a word of target word and disambiguated Processing;
S6, word segmentation processing and removal stop words processing are carried out to disambiguating original statement included in this hereinafter, obtain with The corresponding object statement of original statement;
S7, the criticality for determining word in object statement;
S8, the criticality that object statement is determined according to the criticality of word in object statement;
S9, sentence is ranked up according to the criticality of sentence, obtains target text;
S10, classified using the trained textual classification model based on convolutional neural networks to target text.
Preferably, in S2, the acquisition modes of file include that crawler obtains, online downloading and batch import.
Preferably, in S2 and S6, word segmentation processing is carried out by jieba tool.
Preferably, in S6, word segmentation processing is removed by stopwords tool.
Preferably, in S6, word is contained at least one in object statement.
Preferably, in S7, the criticality of word is used to indicate the correlation of word text to be sorted theme to be expressed Degree.
Preferably, specific step is as follows by S7:
S71, the term vector that word in object statement is determined using preparatory trained first term vector model;
S72, the theme vector that word in object statement is determined using preparatory trained theme vector model;
S73, the theme probability distribution that text to be sorted is determined using preparatory trained first topic model;
S74, according to the term vector of word, the theme vector of word and theme probability distribution, determine the criticality of word.
Preferably, in s 74, comprising the following steps:
Between S741, the term vector that word is determined according to preset similarity calculating method and the theme vector of word One similarity value;
S742, the criticality that word is determined according to the first similarity value and theme probability distribution.
Preferably, in S8, the criticality of the highest word of criticality in object statement is determined as to the pass of object statement Key degree.
Above-mentioned technical proposal of the invention has following beneficial technical effect:
The present invention can carry out text classification based on word sense disambiguation convolutional neural networks, optimize file classification method, mention The high accuracy of text classification efficiency and text classification, it is time saving and energy saving.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the file classification method based on word sense disambiguation convolutional neural networks proposed by the present invention.
Specific embodiment
In order to make the objectives, technical solutions and advantages of the present invention clearer, With reference to embodiment and join According to attached drawing, the present invention is described in more detail.It should be understood that these descriptions are merely illustrative, and it is not intended to limit this hair Bright range.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid this is unnecessarily obscured The concept of invention.
As shown in Figure 1, a kind of file classification method based on word sense disambiguation convolutional neural networks proposed by the present invention, including Following steps:
S1, configuration have determined that the ambiguity dictionary of the meaning of a word;
S2, associated documents are obtained, content of text is extracted from file, and word segmentation processing is carried out to sentence each in text;
S3, the part of speech of sentence each in text is labeled, determines the part of speech of each word in sentence;
S4, it is based on ambiguity dictionary, determines and disambiguates target word;
S5, based on to sentence syntactic analysis and contextual information analysis, determine the meaning of a word of target word and disambiguated Processing;
S6, word segmentation processing and removal stop words processing are carried out to disambiguating original statement included in this hereinafter, obtain with The corresponding object statement of original statement;
S7, the criticality for determining word in object statement;
S8, the criticality that object statement is determined according to the criticality of word in object statement;
S9, sentence is ranked up according to the criticality of sentence, obtains target text;
S10, classified using the trained textual classification model based on convolutional neural networks to target text.
In an alternative embodiment, in S2, the acquisition modes of file include crawler acquisition, online downloading and batch It imports.
In an alternative embodiment, in S2 and S6, word segmentation processing is carried out by jieba tool.
In an alternative embodiment, in S6, word segmentation processing is removed by stopwords tool.
In an alternative embodiment, in S6, word is contained at least one in object statement.
In an alternative embodiment, in S7, the criticality of word is for indicating the wanted table of word text to be sorted The degree of correlation of the theme reached.
In an alternative embodiment, specific step is as follows by S7:
S71, the term vector that word in object statement is determined using preparatory trained first term vector model;
S72, the theme vector that word in object statement is determined using preparatory trained theme vector model;
S73, the theme probability distribution that text to be sorted is determined using preparatory trained first topic model;
S74, according to the term vector of word, the theme vector of word and theme probability distribution, determine the criticality of word.
In an alternative embodiment, in s 74, comprising the following steps:
Between S741, the term vector that word is determined according to preset similarity calculating method and the theme vector of word One similarity value;
S742, the criticality that word is determined according to the first similarity value and theme probability distribution.
In an alternative embodiment, in S8, the criticality of the highest word of criticality in object statement is determined For the criticality of object statement.
In the present invention, the ambiguity dictionary for having determined that the meaning of a word is configured first;Then associated documents are obtained, text is extracted from file This content, and word segmentation processing is carried out to sentence each in text;The part of speech of sentence each in text is labeled later, determines sentence In each word part of speech, and be based on ambiguity dictionary, determine disambiguate target word, then based on to sentence syntactic analysis and up and down Literary information analysis determines the meaning of a word of target word and carries out disambiguation processing;Then to original statement included in disambiguation hereinafter this Word segmentation processing and removal stop words processing are carried out, object statement corresponding with original statement is obtained;Object statement is determined later The criticality of middle word, and determine according to the criticality of word in object statement the criticality of object statement, it is closed in object statement The criticality of the highest word of key degree is the criticality of object statement;Then sentence is ranked up according to the criticality of sentence, Obtain target text;Finally target text is divided using the trained textual classification model based on convolutional neural networks Class;
Wherein it is determined that in object statement the step of the criticality of word are as follows: utilize preparatory trained first term vector Model determines the term vector of word in object statement, determines word in object statement using preparatory trained theme vector model The theme vector of language determines the theme probability distribution of text to be sorted using preparatory trained first topic model, according to Preset similarity calculating method determines the first similarity value between the term vector of word and the theme vector of word, and according to First similarity value and theme probability distribution determine the criticality of word;
The present invention can carry out text classification based on word sense disambiguation convolutional neural networks, optimize file classification method, mention The high accuracy of text classification efficiency and text classification, it is time saving and energy saving.
It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims (9)

1. a kind of file classification method based on word sense disambiguation convolutional neural networks, which comprises the following steps:
S1, configuration have determined that the ambiguity dictionary of the meaning of a word;
S2, associated documents are obtained, content of text is extracted from file, and word segmentation processing is carried out to sentence each in text;
S3, the part of speech of sentence each in text is labeled, determines the part of speech of each word in sentence;
S4, it is based on ambiguity dictionary, determines and disambiguates target word;
S5, based on to sentence syntactic analysis and contextual information analysis, determine the meaning of a word of target word and carry out disambiguation processing;
S6, word segmentation processing and removal stop words processing are carried out to disambiguating original statement included in this hereinafter, obtain with it is original The corresponding object statement of sentence;
S7, the criticality for determining word in object statement;
S8, the criticality that object statement is determined according to the criticality of word in object statement;
S9, sentence is ranked up according to the criticality of sentence, obtains target text;
S10, classified using the trained textual classification model based on convolutional neural networks to target text.
2. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist In in S2, the acquisition modes of file include that crawler obtains, online downloading and batch import.
3. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist In, in S2 and S6, pass through jieba tool carry out word segmentation processing.
4. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist In being removed word segmentation processing by stopwords tool in S6.
5. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist In containing at least one word in object statement in S6.
6. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist In in S7, the criticality of word is used to indicate the degree of correlation of word text to be sorted theme to be expressed.
7. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist In specific step is as follows by S7:
S71, the term vector that word in object statement is determined using preparatory trained first term vector model;
S72, the theme vector that word in object statement is determined using preparatory trained theme vector model;
S73, the theme probability distribution that text to be sorted is determined using preparatory trained first topic model;
S74, according to the term vector of word, the theme vector of word and theme probability distribution, determine the criticality of word.
8. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 7, feature exist In in s 74, comprising the following steps:
The first phase between S741, the term vector that word is determined according to preset similarity calculating method and the theme vector of word Like angle value;
S742, the criticality that word is determined according to the first similarity value and theme probability distribution.
9. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist In in S8, the criticality of the highest word of criticality in object statement to be determined as to the criticality of object statement.
CN201910565070.4A 2019-06-27 2019-06-27 A kind of file classification method based on word sense disambiguation convolutional neural networks Pending CN110321434A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910565070.4A CN110321434A (en) 2019-06-27 2019-06-27 A kind of file classification method based on word sense disambiguation convolutional neural networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910565070.4A CN110321434A (en) 2019-06-27 2019-06-27 A kind of file classification method based on word sense disambiguation convolutional neural networks

Publications (1)

Publication Number Publication Date
CN110321434A true CN110321434A (en) 2019-10-11

Family

ID=68120528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910565070.4A Pending CN110321434A (en) 2019-06-27 2019-06-27 A kind of file classification method based on word sense disambiguation convolutional neural networks

Country Status (1)

Country Link
CN (1) CN110321434A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765757A (en) * 2019-10-16 2020-02-07 腾讯云计算(北京)有限责任公司 Text recognition method, computer-readable storage medium, and computer device
CN111310475A (en) * 2020-02-04 2020-06-19 支付宝(杭州)信息技术有限公司 Training method and device of word sense disambiguation model
CN113723101A (en) * 2021-09-09 2021-11-30 国网电子商务有限公司 Word sense disambiguation method and device applied to intention recognition
US11687724B2 (en) 2020-09-30 2023-06-27 International Business Machines Corporation Word sense disambiguation using a deep logico-neural network
CN117473095A (en) * 2023-12-27 2024-01-30 合肥工业大学 Short text classification method and system based on theme enhancement word representation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488623A (en) * 2013-09-04 2014-01-01 中国科学院计算技术研究所 Multilingual text data sorting treatment method
CN105045913A (en) * 2015-08-14 2015-11-11 北京工业大学 Text classification method based on WordNet and latent semantic analysis
CN107608968A (en) * 2017-09-22 2018-01-19 深圳市易图资讯股份有限公司 Chinese word cutting method, the device of text-oriented big data
CN108241741A (en) * 2017-12-29 2018-07-03 深圳市金立通信设备有限公司 A kind of file classification method, server and computer readable storage medium
US10108674B1 (en) * 2014-08-26 2018-10-23 Twitter, Inc. Method and system for topic disambiguation and classification
CN109408641A (en) * 2018-11-22 2019-03-01 山东工商学院 It is a kind of based on have supervision topic model file classification method and system
CN109726385A (en) * 2017-10-31 2019-05-07 株式会社Ntt都科摩 Word sense disambiguation method and equipment, meaning of a word extended method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488623A (en) * 2013-09-04 2014-01-01 中国科学院计算技术研究所 Multilingual text data sorting treatment method
US10108674B1 (en) * 2014-08-26 2018-10-23 Twitter, Inc. Method and system for topic disambiguation and classification
CN105045913A (en) * 2015-08-14 2015-11-11 北京工业大学 Text classification method based on WordNet and latent semantic analysis
CN107608968A (en) * 2017-09-22 2018-01-19 深圳市易图资讯股份有限公司 Chinese word cutting method, the device of text-oriented big data
CN109726385A (en) * 2017-10-31 2019-05-07 株式会社Ntt都科摩 Word sense disambiguation method and equipment, meaning of a word extended method and device
CN108241741A (en) * 2017-12-29 2018-07-03 深圳市金立通信设备有限公司 A kind of file classification method, server and computer readable storage medium
CN109408641A (en) * 2018-11-22 2019-03-01 山东工商学院 It is a kind of based on have supervision topic model file classification method and system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765757A (en) * 2019-10-16 2020-02-07 腾讯云计算(北京)有限责任公司 Text recognition method, computer-readable storage medium, and computer device
CN111310475A (en) * 2020-02-04 2020-06-19 支付宝(杭州)信息技术有限公司 Training method and device of word sense disambiguation model
CN111310475B (en) * 2020-02-04 2023-03-10 支付宝(杭州)信息技术有限公司 Training method and device of word sense disambiguation model
US11687724B2 (en) 2020-09-30 2023-06-27 International Business Machines Corporation Word sense disambiguation using a deep logico-neural network
CN113723101A (en) * 2021-09-09 2021-11-30 国网电子商务有限公司 Word sense disambiguation method and device applied to intention recognition
CN117473095A (en) * 2023-12-27 2024-01-30 合肥工业大学 Short text classification method and system based on theme enhancement word representation
CN117473095B (en) * 2023-12-27 2024-03-29 合肥工业大学 Short text classification method and system based on theme enhancement word representation

Similar Documents

Publication Publication Date Title
CN110321434A (en) A kind of file classification method based on word sense disambiguation convolutional neural networks
CN104699763B (en) The text similarity gauging system of multiple features fusion
CN110532554A (en) Chinese abstract generation method, system and storage medium
CN108563638B (en) Microblog emotion analysis method based on topic identification and integrated learning
CN107315734B (en) A kind of method and system to be standardized based on time window and semantic variant word
CN108984661A (en) Entity alignment schemes and device in a kind of knowledge mapping
CN110717041B (en) Case retrieval method and system
CN103744953A (en) Network hotspot mining method based on Chinese text emotion recognition
CN106570180A (en) Artificial intelligence based voice searching method and device
CN112036177A (en) Text semantic similarity information processing method and system based on multi-model fusion
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
CN113312922B (en) Improved chapter-level triple information extraction method
CN106649250A (en) Method and device for identifying emotional new words
CN108536781B (en) Social network emotion focus mining method and system
CN113722492A (en) Intention identification method and device
Najafi et al. Text-to-Text Transformer in Authorship Verification Via Stylistic and Semantical Analysis.
CN107526721A (en) A kind of disambiguation method and device to electric business product review vocabulary
CN114997288A (en) Design resource association method
CN107451116B (en) Statistical analysis method for mobile application endogenous big data
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN113361252A (en) Text depression tendency detection system based on multi-modal features and emotion dictionary
CN111191029B (en) AC construction method based on supervised learning and text classification
CN112632259A (en) Automatic dialog intention recognition system based on linguistic rule generation
CN107562774A (en) Generation method, system and the answering method and system of rare foreign languages word incorporation model
CN109241521B (en) Scientific literature high-attention sentence extraction method based on citation relation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191011