CN110321434A - A kind of file classification method based on word sense disambiguation convolutional neural networks - Google Patents
A kind of file classification method based on word sense disambiguation convolutional neural networks Download PDFInfo
- Publication number
- CN110321434A CN110321434A CN201910565070.4A CN201910565070A CN110321434A CN 110321434 A CN110321434 A CN 110321434A CN 201910565070 A CN201910565070 A CN 201910565070A CN 110321434 A CN110321434 A CN 110321434A
- Authority
- CN
- China
- Prior art keywords
- word
- text
- criticality
- convolutional neural
- neural networks
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
A kind of file classification method based on word sense disambiguation convolutional neural networks, comprising the following steps: configuration has determined that the ambiguity dictionary of the meaning of a word;Associated documents are obtained, content of text are extracted from file, and word segmentation processing is carried out to sentence each in text;Determine the part of speech of each word in sentence;It determines and disambiguates target word;It determines the meaning of a word of target word and carries out disambiguation processing;Word segmentation processing is carried out to original statement included in disambiguation hereinafter this and removal stop words is handled, obtains object statement corresponding with original statement;Determine the criticality of word in object statement;Determine the criticality of object statement;Sentence is ranked up according to the criticality of sentence, obtains target text;Classified using the trained textual classification model based on convolutional neural networks to target text.The present invention can carry out text classification based on word sense disambiguation convolutional neural networks, optimize file classification method, improve the accuracy of text classification efficiency and text classification, time saving and energy saving.
Description
Technical field
The present invention relates to Text Classification field more particularly to a kind of texts based on word sense disambiguation convolutional neural networks
Classification method.
Background technique
Increasingly flourishing and netizen's quantity with the network media is continuously increased, and a large amount of text data is constantly producing
Raw, how handling huge text data and correctly classifying is a urgent problem to be solved, and text classification passes through data with existing
Training classifier, and this classifier is used for test document, determine the classification of each document, correct text classification can make
User searches out the information of needs, more easily browsing document faster, and text automatic classification refers to by having class formative
Then the classifier is used to test unknown classification text and identified by training text, i.e. training text classifier;
In existing technology, file classification method mainly includes following several: rule-based method, a large amount of by counting
Text feature and the field relevant knowledge lay down a regulation and pass through rule classification, and this method needs a large amount of time and correlation special
Industry personnel;Based on vector space expression, feature is selected and extracted first, text representation is constructed by vector space, then
Classifier is constructed, this method has ignored the semantic information of word, and dimension is larger, Yi Yinqi dimension disaster problem;Based on distribution
Term vector method, selects and extracts first feature, constructs text representation by the methods of LDA or Word2Vec, later building point
Class device, this method only include one kind of global information or local message, and have ignored another information, and classification accuracy is lower;
Current file classification method is complex, and text classification efficiency is slow, and the accuracy of text classification is lower.
Summary of the invention
(1) goal of the invention
To solve technical problem present in background technique, the present invention proposes a kind of based on word sense disambiguation convolutional neural networks
File classification method, can based on word sense disambiguation convolutional neural networks carry out text classification, optimize file classification method, mention
The high accuracy of text classification efficiency and text classification, it is time saving and energy saving.
(2) technical solution
To solve the above problems, the invention proposes a kind of text classification sides based on word sense disambiguation convolutional neural networks
Method, comprising the following steps:
S1, configuration have determined that the ambiguity dictionary of the meaning of a word;
S2, associated documents are obtained, content of text is extracted from file, and word segmentation processing is carried out to sentence each in text;
S3, the part of speech of sentence each in text is labeled, determines the part of speech of each word in sentence;
S4, it is based on ambiguity dictionary, determines and disambiguates target word;
S5, based on to sentence syntactic analysis and contextual information analysis, determine the meaning of a word of target word and disambiguated
Processing;
S6, word segmentation processing and removal stop words processing are carried out to disambiguating original statement included in this hereinafter, obtain with
The corresponding object statement of original statement;
S7, the criticality for determining word in object statement;
S8, the criticality that object statement is determined according to the criticality of word in object statement;
S9, sentence is ranked up according to the criticality of sentence, obtains target text;
S10, classified using the trained textual classification model based on convolutional neural networks to target text.
Preferably, in S2, the acquisition modes of file include that crawler obtains, online downloading and batch import.
Preferably, in S2 and S6, word segmentation processing is carried out by jieba tool.
Preferably, in S6, word segmentation processing is removed by stopwords tool.
Preferably, in S6, word is contained at least one in object statement.
Preferably, in S7, the criticality of word is used to indicate the correlation of word text to be sorted theme to be expressed
Degree.
Preferably, specific step is as follows by S7:
S71, the term vector that word in object statement is determined using preparatory trained first term vector model;
S72, the theme vector that word in object statement is determined using preparatory trained theme vector model;
S73, the theme probability distribution that text to be sorted is determined using preparatory trained first topic model;
S74, according to the term vector of word, the theme vector of word and theme probability distribution, determine the criticality of word.
Preferably, in s 74, comprising the following steps:
Between S741, the term vector that word is determined according to preset similarity calculating method and the theme vector of word
One similarity value;
S742, the criticality that word is determined according to the first similarity value and theme probability distribution.
Preferably, in S8, the criticality of the highest word of criticality in object statement is determined as to the pass of object statement
Key degree.
Above-mentioned technical proposal of the invention has following beneficial technical effect:
The present invention can carry out text classification based on word sense disambiguation convolutional neural networks, optimize file classification method, mention
The high accuracy of text classification efficiency and text classification, it is time saving and energy saving.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the file classification method based on word sense disambiguation convolutional neural networks proposed by the present invention.
Specific embodiment
In order to make the objectives, technical solutions and advantages of the present invention clearer, With reference to embodiment and join
According to attached drawing, the present invention is described in more detail.It should be understood that these descriptions are merely illustrative, and it is not intended to limit this hair
Bright range.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid this is unnecessarily obscured
The concept of invention.
As shown in Figure 1, a kind of file classification method based on word sense disambiguation convolutional neural networks proposed by the present invention, including
Following steps:
S1, configuration have determined that the ambiguity dictionary of the meaning of a word;
S2, associated documents are obtained, content of text is extracted from file, and word segmentation processing is carried out to sentence each in text;
S3, the part of speech of sentence each in text is labeled, determines the part of speech of each word in sentence;
S4, it is based on ambiguity dictionary, determines and disambiguates target word;
S5, based on to sentence syntactic analysis and contextual information analysis, determine the meaning of a word of target word and disambiguated
Processing;
S6, word segmentation processing and removal stop words processing are carried out to disambiguating original statement included in this hereinafter, obtain with
The corresponding object statement of original statement;
S7, the criticality for determining word in object statement;
S8, the criticality that object statement is determined according to the criticality of word in object statement;
S9, sentence is ranked up according to the criticality of sentence, obtains target text;
S10, classified using the trained textual classification model based on convolutional neural networks to target text.
In an alternative embodiment, in S2, the acquisition modes of file include crawler acquisition, online downloading and batch
It imports.
In an alternative embodiment, in S2 and S6, word segmentation processing is carried out by jieba tool.
In an alternative embodiment, in S6, word segmentation processing is removed by stopwords tool.
In an alternative embodiment, in S6, word is contained at least one in object statement.
In an alternative embodiment, in S7, the criticality of word is for indicating the wanted table of word text to be sorted
The degree of correlation of the theme reached.
In an alternative embodiment, specific step is as follows by S7:
S71, the term vector that word in object statement is determined using preparatory trained first term vector model;
S72, the theme vector that word in object statement is determined using preparatory trained theme vector model;
S73, the theme probability distribution that text to be sorted is determined using preparatory trained first topic model;
S74, according to the term vector of word, the theme vector of word and theme probability distribution, determine the criticality of word.
In an alternative embodiment, in s 74, comprising the following steps:
Between S741, the term vector that word is determined according to preset similarity calculating method and the theme vector of word
One similarity value;
S742, the criticality that word is determined according to the first similarity value and theme probability distribution.
In an alternative embodiment, in S8, the criticality of the highest word of criticality in object statement is determined
For the criticality of object statement.
In the present invention, the ambiguity dictionary for having determined that the meaning of a word is configured first;Then associated documents are obtained, text is extracted from file
This content, and word segmentation processing is carried out to sentence each in text;The part of speech of sentence each in text is labeled later, determines sentence
In each word part of speech, and be based on ambiguity dictionary, determine disambiguate target word, then based on to sentence syntactic analysis and up and down
Literary information analysis determines the meaning of a word of target word and carries out disambiguation processing;Then to original statement included in disambiguation hereinafter this
Word segmentation processing and removal stop words processing are carried out, object statement corresponding with original statement is obtained;Object statement is determined later
The criticality of middle word, and determine according to the criticality of word in object statement the criticality of object statement, it is closed in object statement
The criticality of the highest word of key degree is the criticality of object statement;Then sentence is ranked up according to the criticality of sentence,
Obtain target text;Finally target text is divided using the trained textual classification model based on convolutional neural networks
Class;
Wherein it is determined that in object statement the step of the criticality of word are as follows: utilize preparatory trained first term vector
Model determines the term vector of word in object statement, determines word in object statement using preparatory trained theme vector model
The theme vector of language determines the theme probability distribution of text to be sorted using preparatory trained first topic model, according to
Preset similarity calculating method determines the first similarity value between the term vector of word and the theme vector of word, and according to
First similarity value and theme probability distribution determine the criticality of word;
The present invention can carry out text classification based on word sense disambiguation convolutional neural networks, optimize file classification method, mention
The high accuracy of text classification efficiency and text classification, it is time saving and energy saving.
It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention
Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any
Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention
Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing
Change example.
Claims (9)
1. a kind of file classification method based on word sense disambiguation convolutional neural networks, which comprises the following steps:
S1, configuration have determined that the ambiguity dictionary of the meaning of a word;
S2, associated documents are obtained, content of text is extracted from file, and word segmentation processing is carried out to sentence each in text;
S3, the part of speech of sentence each in text is labeled, determines the part of speech of each word in sentence;
S4, it is based on ambiguity dictionary, determines and disambiguates target word;
S5, based on to sentence syntactic analysis and contextual information analysis, determine the meaning of a word of target word and carry out disambiguation processing;
S6, word segmentation processing and removal stop words processing are carried out to disambiguating original statement included in this hereinafter, obtain with it is original
The corresponding object statement of sentence;
S7, the criticality for determining word in object statement;
S8, the criticality that object statement is determined according to the criticality of word in object statement;
S9, sentence is ranked up according to the criticality of sentence, obtains target text;
S10, classified using the trained textual classification model based on convolutional neural networks to target text.
2. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist
In in S2, the acquisition modes of file include that crawler obtains, online downloading and batch import.
3. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist
In, in S2 and S6, pass through jieba tool carry out word segmentation processing.
4. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist
In being removed word segmentation processing by stopwords tool in S6.
5. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist
In containing at least one word in object statement in S6.
6. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist
In in S7, the criticality of word is used to indicate the degree of correlation of word text to be sorted theme to be expressed.
7. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist
In specific step is as follows by S7:
S71, the term vector that word in object statement is determined using preparatory trained first term vector model;
S72, the theme vector that word in object statement is determined using preparatory trained theme vector model;
S73, the theme probability distribution that text to be sorted is determined using preparatory trained first topic model;
S74, according to the term vector of word, the theme vector of word and theme probability distribution, determine the criticality of word.
8. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 7, feature exist
In in s 74, comprising the following steps:
The first phase between S741, the term vector that word is determined according to preset similarity calculating method and the theme vector of word
Like angle value;
S742, the criticality that word is determined according to the first similarity value and theme probability distribution.
9. a kind of file classification method based on word sense disambiguation convolutional neural networks according to claim 1, feature exist
In in S8, the criticality of the highest word of criticality in object statement to be determined as to the criticality of object statement.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910565070.4A CN110321434A (en) | 2019-06-27 | 2019-06-27 | A kind of file classification method based on word sense disambiguation convolutional neural networks |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910565070.4A CN110321434A (en) | 2019-06-27 | 2019-06-27 | A kind of file classification method based on word sense disambiguation convolutional neural networks |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110321434A true CN110321434A (en) | 2019-10-11 |
Family
ID=68120528
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910565070.4A Pending CN110321434A (en) | 2019-06-27 | 2019-06-27 | A kind of file classification method based on word sense disambiguation convolutional neural networks |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110321434A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110765757A (en) * | 2019-10-16 | 2020-02-07 | 腾讯云计算(北京)有限责任公司 | Text recognition method, computer-readable storage medium, and computer device |
CN111310475A (en) * | 2020-02-04 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Training method and device of word sense disambiguation model |
CN113723101A (en) * | 2021-09-09 | 2021-11-30 | 国网电子商务有限公司 | Word sense disambiguation method and device applied to intention recognition |
US11687724B2 (en) | 2020-09-30 | 2023-06-27 | International Business Machines Corporation | Word sense disambiguation using a deep logico-neural network |
CN117473095A (en) * | 2023-12-27 | 2024-01-30 | 合肥工业大学 | Short text classification method and system based on theme enhancement word representation |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103488623A (en) * | 2013-09-04 | 2014-01-01 | 中国科学院计算技术研究所 | Multilingual text data sorting treatment method |
CN105045913A (en) * | 2015-08-14 | 2015-11-11 | 北京工业大学 | Text classification method based on WordNet and latent semantic analysis |
CN107608968A (en) * | 2017-09-22 | 2018-01-19 | 深圳市易图资讯股份有限公司 | Chinese word cutting method, the device of text-oriented big data |
CN108241741A (en) * | 2017-12-29 | 2018-07-03 | 深圳市金立通信设备有限公司 | A kind of file classification method, server and computer readable storage medium |
US10108674B1 (en) * | 2014-08-26 | 2018-10-23 | Twitter, Inc. | Method and system for topic disambiguation and classification |
CN109408641A (en) * | 2018-11-22 | 2019-03-01 | 山东工商学院 | It is a kind of based on have supervision topic model file classification method and system |
CN109726385A (en) * | 2017-10-31 | 2019-05-07 | 株式会社Ntt都科摩 | Word sense disambiguation method and equipment, meaning of a word extended method and device |
-
2019
- 2019-06-27 CN CN201910565070.4A patent/CN110321434A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103488623A (en) * | 2013-09-04 | 2014-01-01 | 中国科学院计算技术研究所 | Multilingual text data sorting treatment method |
US10108674B1 (en) * | 2014-08-26 | 2018-10-23 | Twitter, Inc. | Method and system for topic disambiguation and classification |
CN105045913A (en) * | 2015-08-14 | 2015-11-11 | 北京工业大学 | Text classification method based on WordNet and latent semantic analysis |
CN107608968A (en) * | 2017-09-22 | 2018-01-19 | 深圳市易图资讯股份有限公司 | Chinese word cutting method, the device of text-oriented big data |
CN109726385A (en) * | 2017-10-31 | 2019-05-07 | 株式会社Ntt都科摩 | Word sense disambiguation method and equipment, meaning of a word extended method and device |
CN108241741A (en) * | 2017-12-29 | 2018-07-03 | 深圳市金立通信设备有限公司 | A kind of file classification method, server and computer readable storage medium |
CN109408641A (en) * | 2018-11-22 | 2019-03-01 | 山东工商学院 | It is a kind of based on have supervision topic model file classification method and system |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110765757A (en) * | 2019-10-16 | 2020-02-07 | 腾讯云计算(北京)有限责任公司 | Text recognition method, computer-readable storage medium, and computer device |
CN111310475A (en) * | 2020-02-04 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Training method and device of word sense disambiguation model |
CN111310475B (en) * | 2020-02-04 | 2023-03-10 | 支付宝(杭州)信息技术有限公司 | Training method and device of word sense disambiguation model |
US11687724B2 (en) | 2020-09-30 | 2023-06-27 | International Business Machines Corporation | Word sense disambiguation using a deep logico-neural network |
CN113723101A (en) * | 2021-09-09 | 2021-11-30 | 国网电子商务有限公司 | Word sense disambiguation method and device applied to intention recognition |
CN117473095A (en) * | 2023-12-27 | 2024-01-30 | 合肥工业大学 | Short text classification method and system based on theme enhancement word representation |
CN117473095B (en) * | 2023-12-27 | 2024-03-29 | 合肥工业大学 | Short text classification method and system based on theme enhancement word representation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110321434A (en) | A kind of file classification method based on word sense disambiguation convolutional neural networks | |
CN104699763B (en) | The text similarity gauging system of multiple features fusion | |
CN110532554A (en) | Chinese abstract generation method, system and storage medium | |
CN108563638B (en) | Microblog emotion analysis method based on topic identification and integrated learning | |
CN107315734B (en) | A kind of method and system to be standardized based on time window and semantic variant word | |
CN108984661A (en) | Entity alignment schemes and device in a kind of knowledge mapping | |
CN110717041B (en) | Case retrieval method and system | |
CN103744953A (en) | Network hotspot mining method based on Chinese text emotion recognition | |
CN106570180A (en) | Artificial intelligence based voice searching method and device | |
CN112036177A (en) | Text semantic similarity information processing method and system based on multi-model fusion | |
CN112434164B (en) | Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration | |
CN113312922B (en) | Improved chapter-level triple information extraction method | |
CN106649250A (en) | Method and device for identifying emotional new words | |
CN108536781B (en) | Social network emotion focus mining method and system | |
CN113722492A (en) | Intention identification method and device | |
Najafi et al. | Text-to-Text Transformer in Authorship Verification Via Stylistic and Semantical Analysis. | |
CN107526721A (en) | A kind of disambiguation method and device to electric business product review vocabulary | |
CN114997288A (en) | Design resource association method | |
CN107451116B (en) | Statistical analysis method for mobile application endogenous big data | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN113361252A (en) | Text depression tendency detection system based on multi-modal features and emotion dictionary | |
CN111191029B (en) | AC construction method based on supervised learning and text classification | |
CN112632259A (en) | Automatic dialog intention recognition system based on linguistic rule generation | |
CN107562774A (en) | Generation method, system and the answering method and system of rare foreign languages word incorporation model | |
CN109241521B (en) | Scientific literature high-attention sentence extraction method based on citation relation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191011 |