CN111199155A - Text classification method and device - Google Patents

Text classification method and device Download PDF

Info

Publication number
CN111199155A
CN111199155A CN201811275675.1A CN201811275675A CN111199155A CN 111199155 A CN111199155 A CN 111199155A CN 201811275675 A CN201811275675 A CN 201811275675A CN 111199155 A CN111199155 A CN 111199155A
Authority
CN
China
Prior art keywords
phrase
target text
vector
model
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811275675.1A
Other languages
Chinese (zh)
Other versions
CN111199155B (en
Inventor
王超
李修鹏
田文宝
赵欣莅
赵东伟
张志朋
樊锐强
刘庆标
尹学正
温连魁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Feihu Information Technology Tianjin Co Ltd
Original Assignee
Feihu Information Technology Tianjin Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Feihu Information Technology Tianjin Co Ltd filed Critical Feihu Information Technology Tianjin Co Ltd
Priority to CN201811275675.1A priority Critical patent/CN111199155B/en
Publication of CN111199155A publication Critical patent/CN111199155A/en
Application granted granted Critical
Publication of CN111199155B publication Critical patent/CN111199155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a text classification method, which is based on a multi-dimensional convolutional neural network model and is used for respectively converting each phrase in a target text into a corresponding phrase semantic representation vector, wherein each phrase comprises a plurality of words which are semantically represented by word embedding vectors; inputting each phrase semantic expression vector in the target text into a multi-granularity long and short term memory model for processing, and determining the average value of the output vector of each hidden layer in the multi-granularity long and short term memory model as a layered semantic vector of the target text; and inputting the hierarchical semantic vector of the target text into a classification model for classification processing to obtain the probability distribution of the target text in a preset type set, and taking the type corresponding to the maximum probability value as the type of the target text. The invention combines the local semantics between words and the global semantics between phrases, enhances the understanding of natural language and further improves the accuracy of text classification.

Description

Text classification method and device
Technical Field
The invention relates to the technical field of data processing, in particular to a text classification method and a text classification device.
Background
At present, the internet has a large amount of text data, and in order to enable a user to efficiently obtain text data to be browsed according to text types, the text data needs to be accurately classified.
The existing text classification methods mainly comprise a dictionary-based method, a traditional machine learning-based method and a deep learning-based method. The method mainly comprises the steps of establishing a series of dictionaries and rules based on a dictionary method, carrying out paragraph borrowing and syntactic analysis on a text as a text classification basis, and enabling an analysis result to depend on the establishment of the rules and a sentence splitting method without universality on the method; the training text is manually labeled based on a traditional machine learning method, then a supervised machine learning process is carried out, and the classification result depends on the selection of feature representation and does not have universality on data. In recent years, deep learning has been favored by many researchers due to its high efficiency, plasticity, and universality, and more researchers have made remarkable results by applying deep learning to the fields of natural language processing and the like.
The existing text classification method based on deep learning has achieved remarkable results, but from the aspects of word feature extraction method, text semantic representation and the like, the following problems mainly exist:
(1) most of the existing word vector acquisition methods extract the characteristics of words according to word frequency, the word order and the semantic information of the words are lost, and the obtained result cannot meet the requirement of semantic analysis.
The method for obtaining the feature vector of the words based on the word frequency assumes that: for a text, the word order and the syntax are ignored and are only regarded as a word set or a word combination, and the occurrence of each word in the text is independent and independent of whether other words occur or not; or when the author writes an article, selecting a word at any one position is not influenced by the preceding sentence and is independently selected. Although the natural language processing is simplified and is convenient for modeling, the purpose of text semantic analysis research is to obtain the category of the whole article from the attribute of the words by using a computer method, and the sequence and semantic information of the words are important influence factors to be considered in the analysis process, so that the assumption is unreasonable.
(2) When the whole text is modeled, the structure of the whole article is not fully considered, and the relation between the local semantic meaning and the global semantic meaning of the text is ignored.
In the text analysis process, how to model the logical relationship among sentences in a document is a problem to be solved urgently. As a research object of text analysis, a text has a composition structure of 'words-sentences-chapters', most existing text analysis methods ignore the hierarchical relationship and directly model by taking the words as basic units, the words can describe basic information of languages, but single words lack correlation, different semantemes are obtained by different combinations of the same words, and the unique granularity of modeling by directly taking the words as text analysis is obviously unreasonable.
Disclosure of Invention
In view of this, the present invention provides a text classification method and device, which combine local semantics between words and global semantics between phrases to enhance understanding of natural language and further improve accuracy of text classification.
In order to achieve the above purpose, the invention provides the following specific technical scheme:
a method of text classification, comprising:
splitting a target text into a plurality of phrases, and splitting each phrase into a plurality of words;
respectively converting each phrase in the target text into a corresponding phrase semantic representation vector based on a multi-dimensional convolutional neural network model, wherein each phrase comprises a plurality of words semantically represented by word embedding vectors;
inputting each phrase semantic expression vector in the target text into a multi-granularity long and short term memory model for processing, and determining the average value of the output vectors of each hidden layer in the multi-granularity long and short term memory model as the hierarchical semantic vector of the target text;
and inputting the hierarchical semantic vector of the target text into a classification model for classification processing to obtain probability distribution of the target text in a preset type set, and taking the type corresponding to the maximum probability value as the type of the target text.
Optionally, the converting each phrase in the target text into a corresponding phrase semantic representation vector based on the multidimensional convolutional neural network model includes:
respectively connecting a plurality of word embedded vectors corresponding to each phrase in a target text in series to obtain a serial vector of each phrase in the target text, wherein the quantity value of words in each phrase in the target text is the same as the width value of a convolution kernel window of a multi-dimensional convolution neural network model;
and respectively inputting the serial vectors of each phrase in the target text into the multi-dimensional convolutional neural network model for processing, performing average sampling on the output vectors of the convolutional layers to obtain input vectors of the pooling layers, performing average folding on the output vectors of the pooling layers, and generating phrase semantic representation vectors corresponding to each phrase in the target text.
Optionally, the inputting each term semantic representation vector in the target text into a multi-granularity long-short term memory model for processing includes:
inputting each phrase semantic representation vector in the target text into a multi-granularity long-short term memory model for processing to obtain an output vector of a first hidden layer;
and for each hidden layer except the first hidden layer, performing non-relevant forgetting operation and relevant updating operation on the input vector corresponding to the hidden layer and the output vector of the previous hidden layer to obtain the output vector of each hidden layer.
Optionally, before the step of converting each phrase in the target text into a corresponding phrase semantic representation vector based on the multidimensional convolutional neural network model, the method further includes:
constructing a training set, a test set and a word vector matrix;
initializing parameters of a multi-dimensional convolutional neural network model, a multi-granularity long and short term memory model and a classification model;
training a multi-dimensional convolutional neural network model, a multi-granularity long and short term memory model and a classification model according to the training set, and adjusting parameters of the multi-dimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model and the word vector matrix by using a back propagation algorithm;
and testing the trained multidimensional convolution neural network model, the multi-granularity long and short term memory model and the classification model by using the test set, and stopping training when the test result meets the accuracy requirement of text classification.
A text classification apparatus comprising:
the target text splitting unit is used for splitting the target text into a plurality of phrases and splitting each phrase into a plurality of words;
the phrase semantic representation unit is used for respectively converting each phrase in the target text into corresponding phrase semantic representation vectors based on the multidimensional convolutional neural network model, wherein each phrase comprises a plurality of words which are semantically represented by word embedding vectors;
the hierarchical semantic expression unit is used for inputting each phrase semantic expression vector in the target text into a multi-granularity long and short term memory model for processing, and determining the average value of the output vector of each hidden layer in the multi-granularity long and short term memory model as the hierarchical semantic vector of the target text;
and the classification processing unit is used for inputting the hierarchical semantic vector of the target text into a classification model for classification processing to obtain the probability distribution of the target text in a preset type set, and taking the type corresponding to the maximum probability value as the type of the target text.
Optionally, the phrase semantic representation unit is specifically configured to respectively concatenate a plurality of word embedding vectors corresponding to each phrase in the target text to obtain a concatenated vector of each phrase in the target text, where a number value of a word in each phrase in the target text is the same as a width value of a convolution kernel window of the multidimensional convolution neural network model; and respectively inputting the serial vectors of each phrase in the target text into the multi-dimensional convolutional neural network model for processing, performing average sampling on the output vectors of the convolutional layers to obtain input vectors of the pooling layers, performing average folding on the output vectors of the pooling layers, and generating phrase semantic representation vectors corresponding to each phrase in the target text.
Optionally, the hierarchical semantic representation unit is specifically configured to input each term semantic representation vector in the target text into a multi-granularity long-term and short-term memory model for processing, so as to obtain an output vector of a first hidden layer; and for each hidden layer except the first hidden layer, performing non-relevant forgetting operation and relevant updating operation on the input vector corresponding to the hidden layer and the output vector of the previous hidden layer to obtain the output vector of each hidden layer.
Optionally, the apparatus further comprises:
the building unit is used for building a training set, a test set and a word vector matrix;
the initialization unit is used for initializing parameters of the multi-dimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model;
the model training unit is used for training the multidimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model according to the training set and adjusting the parameters of the multidimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model and the word vector matrix by using a back propagation algorithm;
and the model testing unit is used for testing the trained multidimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model by using the test set, and stopping training when the test result meets the accuracy requirement of text classification.
Compared with the prior art, the invention has the following beneficial effects:
the invention discloses a text classification method, which is characterized in that semantic representation is carried out on words in each phrase in a target text by utilizing word embedding vectors, and each phrase in the target text is converted into corresponding phrase semantic representation vectors respectively based on a multi-dimensional convolutional neural network model so as to represent the local semantic relationship between words in the phrases. And processing all phrase semantic expression vectors in the target text by using the multi-granularity long and short term memory model, determining the average value of the output vectors of each hidden layer in the multi-granularity long and short term memory model as the layered semantic vector of the target text, and realizing semantic expression between phrases at different intervals in the target text. And the hierarchical semantic vectors obtained by combining the local semantics among the words and the global semantics among the phrases are used as the feature vectors for classification to perform classification processing, so that the understanding of natural language is enhanced, and the accuracy of text classification is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a diagram illustrating text classification performed by a hierarchical neural network model according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a text classification method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a multidimensional convolutional neural network model disclosed in an embodiment of the present invention;
FIG. 4-a is a schematic diagram of a classical LSTM model disclosed in an embodiment of the present invention;
FIG. 4-b is a schematic diagram of a multi-granularity long-short term memory model according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart of a model construction method disclosed in the embodiments of the present invention;
fig. 6 is a schematic structural diagram of a text classification device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Based on the problems of the existing text classification method based on deep learning, for example, the existing word vector acquisition method mostly extracts the characteristics of words according to word frequency, loses the semantic information of word order and syntax, and the obtained result can not meet the requirement of semantic analysis; when the whole text is modeled, the structure of the whole article is not fully considered, and the problems of the relation between the local semantics and the global semantics of the text and the like are ignored. The invention provides a text classification method, which is used for carrying out hierarchical splitting on a text: word-phrase-text, and classifying the textual components using a hierarchical neural network model, wherein the hierarchical neural network model comprises: a multidimensional convolution neural network model, a multi-granularity long and short term memory model and a classification model. Referring to fig. 1, fig. 1 is a schematic diagram of text classification performed by a hierarchical neural network model, where operation 1 is an operation performed by a multidimensional convolutional neural network model to convert a phrase into a phrase semantic representation vector; operation 2 is an operation performed by the multi-granularity long-short term memory model, and converts a plurality of phrase semantic representation vectors into a text hierarchical semantic vector.
Specifically, please refer to fig. 2, this embodiment discloses a text classification method, which specifically includes the following steps:
s101: splitting a target text into a plurality of phrases, and splitting each phrase into a plurality of words;
specifically, the method for splitting the target text into a plurality of phrases may be any one of the existing text splitting methods based on semantics, and the method for splitting the phrases into a plurality of words may be any one of the existing word splitting methods.
S102: respectively converting each phrase in the target text into corresponding phrase semantic expression vectors based on a multi-dimensional convolutional neural network model;
wherein each phrase comprises a plurality of words semantically represented by word-embedded vectors, each word is represented as a low-dimensional, continuous, real-valued vector, and all vectors are stored in a word vector matrix L e Rdim×|V|Where dim is the dimension of word embedding and V is the dictionary.
Convolutional Neural Network (CNN) is a relatively advanced semantic model for natural language processing at present, and can learn a fixed-length vector from a variable-length phrase, and its calculation process mainly depends on the order of words in the phrase and does not depend on a syntax tree. The multidimensional convolutional neural network model is an improved convolutional neural network model, an input vector of the multidimensional convolutional neural network model is a serial vector of each phrase in a target text, and the serial vector of each phrase is obtained by respectively connecting a plurality of word embedding vectors corresponding to each phrase in the target text in series.
Referring to fig. 3, fig. 3 is a schematic diagram of a multidimensional convolutional neural network model, wherein a word in each phrase in a target text is semantically represented by a word embedding vector in a first lookup layer of the multidimensional convolutional neural network model; and the convolution layer of the second layer performs convolution operation on the word vectors by using convolution cores with different window sizes, finally performs average sampling on the output vectors of the convolution layers to obtain the input vectors of the pooling layer, performs average folding on the output vectors of the pooling layer, and generates phrase semantic expression vectors corresponding to each phrase in the target text.
Convolution kernels with different window sizes are used to capture context semantics at different granularities and generate semantic representations of the phrases based thereon. This approach has proven to work well in the field of emotion analysis and annotation. In this embodiment, we use convolution kernels with windows 3, 4, 5 to capture the semantics of 3, 4, 5-grams.
Given a phrase containing n words w1,w2,w3,…,wnLcf is the window width of the convolution kernel cf, WcfAnd bcfEach word w in the phrase being a shared parameter, respectively, which is a linear part of the convolution kernel cfiAll pass through the word vector matrix L ∈ Rdim ×|V|Is mapped as a word embedding vector we corresponding to the wordi∈RdimWhere dim is the dimension of word embedding and | V | is the size of the lexicon. The input of the convolution layer is the word-embedded concatenation corresponding to lcf words in the convolution kernel window, i.e. Icf=[wei;wei+1;…;wei+lcf-1]∈Rdim·lcf. The output of the convolutional layer is thus expressed as:
Ocf=tanh(Wcf·Icf+bcf),
wherein, Wcf∈Rlocf×dim·lcf,bcf∈RlocfWhere locf is the length of the convolutional layer output and tanh adds non-linear characteristics to the convolution operation.
After that, we average the output of the convolutional layer to obtain global semantics, and then merge all the convolutional layer outputs through one average folding layer to obtain the final phrase semantic representation.
S103: inputting each phrase semantic expression vector in the target text into a multi-granularity long and short term memory model for processing, and determining the average value of the output vectors of each hidden layer in the multi-granularity long and short term memory model as the hierarchical semantic vector of the target text;
referring to FIG. 4-a, the output of the last hidden layer in the classical LSTM model is usually used as the final text representation. Referring to fig. 4-b, the multi-granularity Long-Short Term Memory model is an improvement of the classic LSTM (Long Short-Term Memory) model, and averages the outputs of all hidden layers to obtain the final text semantic representation. In this way, we can consider the semantic and emotional logical relations between different intervals of phrases.
Specifically, the transfer function of the multi-granularity long-short term memory model in this embodiment is as follows:
ft=δ(Wf·[ht-1;xt]+bf)
it=δ(Wi·[ht-1;xt]+bi)
Ct′=tanh(WC·[ht-1;xt]+bC)
Ct=ft⊙Ct-1+it⊙Ct
ht=δ(Wh·[ht-1;xt]+bh⊙tanh(Ct)
wherein x istInputting a vector of the LSTM model in the t step, namely a phrase semantic expression vector of the t phrase in the target text;
ft、it、Wf、Wi、bfand biRespectively responsible for forgetting and updating operations of a hidden layer vector (hidden vector) and an input vector;
WCand bCFor generating candidate vector Ct’;
ht-1The hidden layer vector of the LSTM is used for representing historical information and storing the knowledge accumulated in the previous t-1 step;
Ct-1and Ct' respectively at the t step, the original state vector and the candidate vector of the neuron;
Whand bhThe method comprises the steps of obtaining a new hidden layer vector by updating an original hidden layer vector, an input vector and a state vector of a neuron;
⊙ is the element-wise multiplication of two vectors;
htand the output vector of the hidden layer in the t step is shown.
The output vector (i.e., h) of the hidden layer at each step1,h2,……hn) Is determined as a hierarchical semantic vector of the target text.
S104: and inputting the hierarchical semantic vector of the target text into a classification model for classification processing to obtain probability distribution of the target text in a preset type set, and taking the type corresponding to the maximum probability value as the type of the target text.
In particular, the classification model may be a softmax classifier.
E.g. ith text c(i)The text should be assigned to e in the preset type setkThe conditional probability of class (k ═ 1,2, 3., | E |) can be calculated by the softmax function:
Figure RE-GDA0001985435890000081
wherein c is(i)For the ith text, x(i)As text c(i)The hierarchical semantic expression vector of (a), E is a preset type set, and omega is a hierarchical semantic expression vector x(i)Transition matrix to E real number domain vector, omegajCan be regarded as x(i)For each item in type ejThe binding coefficient of (c).
In the text classification method disclosed in this embodiment, semantic representation is performed on words in each phrase in a target text by using word embedding vectors, and each phrase in the target text is converted into a corresponding phrase semantic representation vector based on a multidimensional convolutional neural network model, so as to represent a local semantic relationship between words in the phrases. And processing all phrase semantic expression vectors in the target text by using the multi-granularity long and short term memory model, determining the average value of the output vectors of each hidden layer in the multi-granularity long and short term memory model as the layered semantic vector of the target text, and realizing semantic expression between phrases at different intervals in the target text. And the hierarchical semantic vectors obtained by combining the local semantics among the words and the global semantics among the phrases are used as the feature vectors for classification to perform classification processing, so that the understanding of natural language is enhanced, and the accuracy of text classification is further improved.
It should be noted that before text classification, a multi-dimensional convolutional neural network model, a multi-granularity long-short term memory model, and a classification model need to be constructed, please refer to fig. 5, where fig. 5 is a schematic flow chart of the model construction method, and specifically includes the following steps:
s401: constructing a training set, a test set and a word vector matrix;
s402: initializing parameters of a multi-dimensional convolutional neural network model, a multi-granularity long and short term memory model and a classification model;
s403: training a multi-dimensional convolutional neural network model, a multi-granularity long and short term memory model and a classification model according to the training set, and adjusting parameters of the multi-dimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model and the word vector matrix by using a back propagation algorithm;
and inputting the text data in the training set into a multi-dimensional convolutional neural network model, a multi-granularity long and short term memory model and a classification model for processing, wherein the text data is the data which is obtained by splitting the text and comprises phrases and words. Specifically, each phrase in a training text is converted into a corresponding phrase semantic representation vector respectively based on a multi-dimensional convolutional neural network model, wherein each phrase comprises a plurality of words semantically represented by word embedding vectors; inputting each phrase semantic expression vector in the training text into a multi-granularity long and short term memory model for processing, and determining the average value of the output vector of each hidden layer in the multi-granularity long and short term memory model as a layered semantic vector of the training text; and inputting the layered semantic vectors of the training continuous texts into a classification model for classification processing to obtain the probability distribution of the training texts in a preset type set, and taking the type corresponding to the maximum probability value as the type of the training texts.
S404: and testing the trained multidimensional convolution neural network model, the multi-granularity long and short term memory model and the classification model by using the test set, and stopping training when the test result meets the accuracy requirement of text classification.
Specifically, a threshold of accuracy of the text classification may be preset, and the training may be stopped when the accuracy of the test result reaches the threshold of accuracy of the preset text classification.
It should be noted that the text classification method disclosed in this embodiment may be applied to text classification in any field, and for further explaining the model construction method and the text classification method disclosed in this embodiment, the following description is given by using a scene embodiment.
At present, a large number of edge-wiping videos and title parties are filled in internet videos, the content of the videos does not relate to sensitive contents such as pornography and the like, but the titles and the pictures of the covers are popular and have poor inductivity, the texts do not match the questions, the user experience is influenced, and the clicking and film watching behaviors of the users are misled, so that the video understanding and recommendation are further influenced, and a method for detecting the edge-wiping video titles is needed. When the text classification is applied to the edge deletion title detection, the model construction method comprises the following steps:
inputting: title training set CtrainTitle test set CtestThe word vector matrix L ∈ Rdim×|V|
And (3) outputting: test set CtestEdge wiping class label
Figure RE-GDA0001985435890000101
Figure RE-GDA0001985435890000111
Referring to fig. 6, the present embodiment correspondingly discloses a text classification device, which includes:
a target text splitting unit 501, configured to split a target text into multiple phrases, and split each phrase into multiple words;
a phrase semantic representation unit 502, configured to convert each phrase in the target text into a corresponding phrase semantic representation vector respectively based on a multidimensional convolutional neural network model, where each phrase includes a plurality of words semantically represented by word-embedded vectors;
optionally, the phrase semantic representation unit 502 is specifically configured to respectively concatenate a plurality of word embedding vectors corresponding to each phrase in the target text to obtain a concatenated vector of each phrase in the target text, where a number value of a word in each phrase in the target text is the same as a width value of a convolution kernel window of the multidimensional convolution neural network model; and respectively inputting the serial vectors of each phrase in the target text into the multi-dimensional convolutional neural network model for processing, performing average sampling on the output vectors of the convolutional layers to obtain input vectors of the pooling layers, performing average folding on the output vectors of the pooling layers, and generating phrase semantic representation vectors corresponding to each phrase in the target text.
A hierarchical semantic representation unit 503, configured to input each term semantic representation vector in the target text into a multi-granularity long and short term memory model for processing, and determine an average value of output vectors of each hidden layer in the multi-granularity long and short term memory model as a hierarchical semantic vector of the target text;
optionally, the hierarchical semantic representation unit 503 is specifically configured to input each term semantic representation vector in the target text into a multi-granularity long-term and short-term memory model for processing, so as to obtain an output vector of a first hidden layer; and for each hidden layer except the first hidden layer, performing non-relevant forgetting operation and relevant updating operation on the input vector corresponding to the hidden layer and the output vector of the previous hidden layer to obtain the output vector of each hidden layer.
And a classification processing unit 504, configured to input the hierarchical semantic vector of the target text into a classification model for classification processing, to obtain probability distribution of the target text in a preset type set, and use a type corresponding to the maximum probability value as the type of the target text.
Optionally, the apparatus further comprises:
the building unit is used for building a training set, a test set and a word vector matrix;
the initialization unit is used for initializing parameters of the multi-dimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model;
the model training unit is used for training the multidimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model according to the training set and adjusting the parameters of the multidimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model and the word vector matrix by using a back propagation algorithm;
and the model testing unit is used for testing the trained multidimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model by using the test set, and stopping training when the test result meets the accuracy requirement of text classification.
In the text classification device disclosed in this embodiment, word embedding vectors are used to perform semantic representation on words in each phrase in a target text, and each phrase in the target text is converted into a corresponding phrase semantic representation vector based on a multidimensional convolutional neural network model, so as to represent a local semantic relationship between words in the phrases. And processing all phrase semantic expression vectors in the target text by using the multi-granularity long and short term memory model, determining the average value of the output vectors of each hidden layer in the multi-granularity long and short term memory model as the layered semantic vector of the target text, and realizing semantic expression between phrases at different intervals in the target text. And the hierarchical semantic vectors obtained by combining the local semantics among the words and the global semantics among the phrases are used as the feature vectors for classification to perform classification processing, so that the understanding of natural language is enhanced, and the accuracy of text classification is further improved.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A method of text classification, comprising:
splitting a target text into a plurality of phrases, and splitting each phrase into a plurality of words;
respectively converting each phrase in the target text into a corresponding phrase semantic representation vector based on a multi-dimensional convolutional neural network model, wherein each phrase comprises a plurality of words semantically represented by word embedding vectors;
inputting each phrase semantic expression vector in the target text into a multi-granularity long and short term memory model for processing, and determining the average value of the output vectors of each hidden layer in the multi-granularity long and short term memory model as the hierarchical semantic vector of the target text;
and inputting the hierarchical semantic vector of the target text into a classification model for classification processing to obtain probability distribution of the target text in a preset type set, and taking the type corresponding to the maximum probability value as the type of the target text.
2. The method according to claim 1, wherein the converting each phrase in the target text into a corresponding phrase semantic representation vector based on the multidimensional convolutional neural network model comprises:
respectively connecting a plurality of word embedded vectors corresponding to each phrase in a target text in series to obtain a serial vector of each phrase in the target text, wherein the quantity value of words in each phrase in the target text is the same as the width value of a convolution kernel window of a multi-dimensional convolution neural network model;
and respectively inputting the serial vectors of each phrase in the target text into the multi-dimensional convolutional neural network model for processing, performing average sampling on the output vectors of the convolutional layers to obtain input vectors of the pooling layers, performing average folding on the output vectors of the pooling layers, and generating phrase semantic representation vectors corresponding to each phrase in the target text.
3. The method according to claim 1, wherein the inputting each phrase semantic representation vector in the target text into a multi-granularity long-short term memory model for processing comprises:
inputting each phrase semantic representation vector in the target text into a multi-granularity long-short term memory model for processing to obtain an output vector of a first hidden layer;
and for each hidden layer except the first hidden layer, performing non-relevant forgetting operation and relevant updating operation on the input vector corresponding to the hidden layer and the output vector of the previous hidden layer to obtain the output vector of each hidden layer.
4. The method of claim 1, wherein before the step of converting each phrase in the target text into the corresponding phrase semantic representation vector based on the multidimensional convolutional neural network model, the method further comprises:
constructing a training set, a test set and a word vector matrix;
initializing parameters of a multi-dimensional convolutional neural network model, a multi-granularity long and short term memory model and a classification model;
training a multi-dimensional convolutional neural network model, a multi-granularity long and short term memory model and a classification model according to the training set, and adjusting parameters of the multi-dimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model and the word vector matrix by using a back propagation algorithm;
and testing the trained multidimensional convolution neural network model, the multi-granularity long and short term memory model and the classification model by using the test set, and stopping training when the test result meets the accuracy requirement of text classification.
5. A text classification apparatus, comprising:
the target text splitting unit is used for splitting the target text into a plurality of phrases and splitting each phrase into a plurality of words;
the phrase semantic representation unit is used for respectively converting each phrase in the target text into corresponding phrase semantic representation vectors based on the multidimensional convolutional neural network model, wherein each phrase comprises a plurality of words which are semantically represented by word embedding vectors;
the hierarchical semantic expression unit is used for inputting each phrase semantic expression vector in the target text into a multi-granularity long and short term memory model for processing, and determining the average value of the output vector of each hidden layer in the multi-granularity long and short term memory model as the hierarchical semantic vector of the target text;
and the classification processing unit is used for inputting the hierarchical semantic vector of the target text into a classification model for classification processing to obtain the probability distribution of the target text in a preset type set, and taking the type corresponding to the maximum probability value as the type of the target text.
6. The apparatus according to claim 5, wherein the phrase semantic representation unit is specifically configured to concatenate a plurality of word embedding vectors corresponding to each phrase in the target text, respectively, to obtain a concatenated vector of each phrase in the target text, where a number value of words in each phrase in the target text is the same as a window width value of a convolution kernel of the multidimensional convolution neural network model; and respectively inputting the serial vectors of each phrase in the target text into the multi-dimensional convolutional neural network model for processing, performing average sampling on the output vectors of the convolutional layers to obtain input vectors of the pooling layers, performing average folding on the output vectors of the pooling layers, and generating phrase semantic representation vectors corresponding to each phrase in the target text.
7. The apparatus according to claim 5, wherein the hierarchical semantic representation unit is specifically configured to input each term semantic representation vector in the target text into a multi-granularity long-short term memory model for processing, so as to obtain an output vector of a first hidden layer; and for each hidden layer except the first hidden layer, performing non-relevant forgetting operation and relevant updating operation on the input vector corresponding to the hidden layer and the output vector of the previous hidden layer to obtain the output vector of each hidden layer.
8. The apparatus of claim 5, further comprising:
the building unit is used for building a training set, a test set and a word vector matrix;
the initialization unit is used for initializing parameters of the multi-dimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model;
the model training unit is used for training the multidimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model according to the training set and adjusting the parameters of the multidimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model and the word vector matrix by using a back propagation algorithm;
and the model testing unit is used for testing the trained multidimensional convolutional neural network model, the multi-granularity long and short term memory model and the classification model by using the test set, and stopping training when the test result meets the accuracy requirement of text classification.
CN201811275675.1A 2018-10-30 2018-10-30 Text classification method and device Active CN111199155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811275675.1A CN111199155B (en) 2018-10-30 2018-10-30 Text classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811275675.1A CN111199155B (en) 2018-10-30 2018-10-30 Text classification method and device

Publications (2)

Publication Number Publication Date
CN111199155A true CN111199155A (en) 2020-05-26
CN111199155B CN111199155B (en) 2023-09-15

Family

ID=70745809

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811275675.1A Active CN111199155B (en) 2018-10-30 2018-10-30 Text classification method and device

Country Status (1)

Country Link
CN (1) CN111199155B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052622A (en) * 2020-08-11 2020-12-08 国网河北省电力有限公司 Defect disposal method for deep multi-view semantic document representation under cloud platform
CN112215000A (en) * 2020-10-21 2021-01-12 重庆邮电大学 Text classification method based on entity replacement
CN112966103A (en) * 2021-02-05 2021-06-15 成都信息工程大学 Mixed attention mechanism text title matching method based on multi-task learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834747A (en) * 2015-05-25 2015-08-12 中国科学院自动化研究所 Short text classification method based on convolution neutral network
CN106601228A (en) * 2016-12-09 2017-04-26 百度在线网络技术(北京)有限公司 Sample marking method and device based on artificial intelligence prosody prediction
CN108595632A (en) * 2018-04-24 2018-09-28 福州大学 A kind of hybrid neural networks file classification method of fusion abstract and body feature
CA3000166A1 (en) * 2017-04-03 2018-10-03 Royal Bank Of Canada Systems and methods for cyberbot network detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834747A (en) * 2015-05-25 2015-08-12 中国科学院自动化研究所 Short text classification method based on convolution neutral network
CN106601228A (en) * 2016-12-09 2017-04-26 百度在线网络技术(北京)有限公司 Sample marking method and device based on artificial intelligence prosody prediction
CA3000166A1 (en) * 2017-04-03 2018-10-03 Royal Bank Of Canada Systems and methods for cyberbot network detection
CN108595632A (en) * 2018-04-24 2018-09-28 福州大学 A kind of hybrid neural networks file classification method of fusion abstract and body feature

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘敬学: "字符级卷积神经网络短文本分类算法" *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052622A (en) * 2020-08-11 2020-12-08 国网河北省电力有限公司 Defect disposal method for deep multi-view semantic document representation under cloud platform
CN112215000A (en) * 2020-10-21 2021-01-12 重庆邮电大学 Text classification method based on entity replacement
CN112215000B (en) * 2020-10-21 2022-08-23 重庆邮电大学 Text classification method based on entity replacement
CN112966103A (en) * 2021-02-05 2021-06-15 成都信息工程大学 Mixed attention mechanism text title matching method based on multi-task learning
CN112966103B (en) * 2021-02-05 2022-04-19 成都信息工程大学 Mixed attention mechanism text title matching method based on multi-task learning

Also Published As

Publication number Publication date
CN111199155B (en) 2023-09-15

Similar Documents

Publication Publication Date Title
CN106980683B (en) Blog text abstract generating method based on deep learning
CN110059188B (en) Chinese emotion analysis method based on bidirectional time convolution network
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN110704576B (en) Text-based entity relationship extraction method and device
CN104965822B (en) A kind of Chinese text sentiment analysis method based on Computerized Information Processing Tech
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN108062388A (en) Interactive reply generation method and device
CN106569998A (en) Text named entity recognition method based on Bi-LSTM, CNN and CRF
WO2023108993A1 (en) Product recommendation method, apparatus and device based on deep clustering algorithm, and medium
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
CN112328900A (en) Deep learning recommendation method integrating scoring matrix and comment text
WO2023134083A1 (en) Text-based sentiment classification method and apparatus, and computer device and storage medium
CN113704546A (en) Video natural language text retrieval method based on space time sequence characteristics
CN110750648A (en) Text emotion classification method based on deep learning and feature fusion
WO2023159758A1 (en) Data enhancement method and apparatus, electronic device, and storage medium
CN111078833A (en) Text classification method based on neural network
CN111199155B (en) Text classification method and device
CN111666752B (en) Circuit teaching material entity relation extraction method based on keyword attention mechanism
CN112364168A (en) Public opinion classification method based on multi-attribute information fusion
CN111581392B (en) Automatic composition scoring calculation method based on statement communication degree
CN111984782A (en) Method and system for generating text abstract of Tibetan language
CN113343690A (en) Text readability automatic evaluation method and device
CN116049387A (en) Short text classification method, device and medium based on graph convolution
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant