CN111241273A - Text data classification method and device, electronic equipment and computer readable medium - Google Patents

Text data classification method and device, electronic equipment and computer readable medium Download PDF

Info

Publication number
CN111241273A
CN111241273A CN201811442384.7A CN201811442384A CN111241273A CN 111241273 A CN111241273 A CN 111241273A CN 201811442384 A CN201811442384 A CN 201811442384A CN 111241273 A CN111241273 A CN 111241273A
Authority
CN
China
Prior art keywords
vocabulary
text
training
text data
entries
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811442384.7A
Other languages
Chinese (zh)
Inventor
吴建党
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201811442384.7A priority Critical patent/CN111241273A/en
Publication of CN111241273A publication Critical patent/CN111241273A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a text data classification method, a text data classification device, an electronic device and a computer readable medium. Relates to the field of computer information processing, and the method comprises the following steps: preprocessing text data to be processed to generate a plurality of entries; determining word vectors for the entries respectively through a vocabulary dictionary; constructing a text entry vector matrix through a plurality of entries and corresponding word vectors thereof; and determining the category label of the text data to be processed by the text entry vector matrix and a text classification model. The text data classification method, the text data classification device, the electronic equipment and the computer readable medium can overcome the technical defects in the prior art and improve the accuracy and efficiency of the text data classification method.

Description

Text data classification method and device, electronic equipment and computer readable medium
Technical Field
The disclosure relates to the field of computer information processing, and in particular, to a text data classification method and device, an electronic device and a computer readable medium.
Background
The text classification problem is a problem of automatically classifying and marking a text set according to a certain classification system or standard. The text classification problem is a classic problem in the field of natural language processing, and is rapidly developed along with the development of statistical learning methods. The text classification problem can be split into two parts, namely a feature engineering part and a classifier part.
The common text classification method adopts word segmentation to extract features, and establishes a classification model by means of a machine learning method to realize a text classification task. The common text representation mode is a word vector model, and the basic idea is to represent each word as an n-dimensional dense continuous real vector, so that text data is changed from a high-latitude sparse neural network intractable mode into continuous dense data similar to images and languages, and various deep learning classification algorithms can be migrated to the field of texts, and then text classification recognition is performed.
Although the conventional text classification method can achieve good results on a test set, the conventional text classification method is difficult to perform well under formal engineering environments, and therefore, a new text data classification method, device, electronic equipment and computer readable medium are needed.
The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of the above, the present disclosure provides a text data classification method, a text data classification device, an electronic device, and a computer readable medium, which can solve the technical defects in the prior art and improve the accuracy and efficiency of the text data classification method.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to an aspect of the present disclosure, a text data classification method is provided, which includes: preprocessing text data to be processed to generate a plurality of entries; determining word vectors for the entries respectively through a vocabulary dictionary; constructing a text entry vector matrix through a plurality of entries and corresponding word vectors thereof; and determining the category label of the text data to be processed by the text entry vector matrix and a text classification model.
In an exemplary embodiment of the present disclosure, further comprising: training a word vector model through a plurality of historical text data to determine the vocabulary dictionary.
In an exemplary embodiment of the present disclosure, training a word vector model through a plurality of historical text data to determine the vocabulary dictionary comprises: preprocessing historical text data to generate a plurality of historical entries; inputting the plurality of historical terms into a term vector model; training the word vector model through a plurality of historical text data to obtain a training result; and generating the vocabulary dictionary through the training result.
In an exemplary embodiment of the present disclosure, preprocessing the historical text data, generating a plurality of historical terms, comprises: and preprocessing the historical text data according to a preset category to generate a plurality of historical entries.
In an exemplary embodiment of the present disclosure, further comprising: training a machine learning algorithm through training data with class labels to determine the text classification model.
In an exemplary embodiment of the present disclosure, training a machine learning algorithm with training data with class labels to determine the text classification model comprises: and training a random forest classifier through training data with class labels to determine the text classification model.
In an exemplary embodiment of the present disclosure, training a machine learning algorithm with training data with class labels to determine the text classification model comprises: generating training data through historical text data; labeling the training data; training a machine learning algorithm through the training data and the labels thereof; and generating the text classification model according to the parameters of the machine learning algorithm corresponding to the optimal training result.
In an exemplary embodiment of the present disclosure, determining word vectors for the plurality of entries respectively by the vocabulary dictionary comprises: judging whether each entry of the plurality of entries has a corresponding vocabulary in the vocabulary dictionary; and when the vocabulary entry has a corresponding vocabulary in the vocabulary dictionary, representing the vocabulary entry by the word vector corresponding to the vocabulary.
In an exemplary embodiment of the present disclosure, determining word vectors for the plurality of entries respectively by the vocabulary dictionary comprises: and when the vocabulary entry does not have a corresponding vocabulary in the vocabulary dictionary, determining a word vector corresponding to the vocabulary through the similarity between the vocabulary entry and the vocabulary in the vocabulary dictionary.
In an exemplary embodiment of the present disclosure, determining word vectors corresponding to the vocabulary in the vocabulary dictionary through similarity between the vocabulary entry and the vocabulary in the vocabulary dictionary includes: calculating the similarity between the entry and each vocabulary in the vocabulary dictionary; and representing the vocabulary with the word vector of the vocabulary with the maximum similarity.
According to an aspect of the present disclosure, there is provided a text data classification apparatus including: the preprocessing module is used for preprocessing the text data to be processed to generate a plurality of entries; the word vector module is used for respectively determining word vectors for the plurality of entries through the vocabulary dictionary; the matrix module is used for constructing a text entry vector matrix through a plurality of entries and word vectors corresponding to the entries; and the label module is used for determining the category label of the text data to be processed by the text entry vector matrix and the text classification model.
According to an aspect of the present disclosure, an electronic device is provided, the electronic device including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.
According to an aspect of the disclosure, a computer-readable medium is proposed, on which a computer program is stored, which program, when being executed by a processor, carries out the method as above.
According to the text data classification method, the text data classification device, the electronic equipment and the computer readable medium, the technical defects in the prior art can be overcome, and the accuracy and the efficiency of the text data classification method are improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are merely some embodiments of the present disclosure, and other drawings may be derived from those drawings by those of ordinary skill in the art without inventive effort.
Fig. 1 is a system block diagram illustrating a text data classification method and apparatus according to an exemplary embodiment.
FIG. 2 is a flow diagram illustrating a method of text data classification in accordance with an exemplary embodiment.
Fig. 3 is a flowchart illustrating a text data classification method according to another exemplary embodiment.
Fig. 4 is a flowchart illustrating a text data classification method according to another exemplary embodiment.
Fig. 5 is a block diagram illustrating a text data classification apparatus according to an exemplary embodiment.
Fig. 6 is a block diagram illustrating a text data classification apparatus according to another exemplary embodiment.
FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.
FIG. 8 is a schematic diagram illustrating a computer-readable storage medium according to an example embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the disclosed concept. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It is to be understood by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present disclosure and are, therefore, not intended to limit the scope of the present disclosure.
The inventor finds that the traditional text classification method adopts word segmentation to extract features, and establishes a classification model by means of a machine learning method to realize a text classification task. The method for representing the text features determines the key of high and low text classification accuracy, and a text feature representation method is based on a space vector model and adopts one-hot coding to construct a document-entry matrix. After feature extraction, machine learning models which can be applied to classification tasks include Bayesian models, SVM models, random forests, ensemble-based methods and the like
Yet another text representation is the word vector model, whose basic idea is to represent each word as a dense, continuous, real number vector of dimensions n. The distributed representation has the greatest advantages that the distributed representation has very strong characterization capability, the problem of one-hot coding discrete sparsity is solved, and meanwhile, the connection between words is well solved. The text representation changes the text data from a high-latitude sparse neural network intractable mode into continuous dense data similar to images and languages through a word vector representation method, so that various classification algorithms for deep learning can be migrated to the field of the text.
The prior art has solved the current text classification problem to some extent, but there are still many disadvantages and places to improve.
First, conventional text classification methods, while able to achieve good performance in a test set, are difficult to perform well in a formal engineering environment. There are two main reasons: (1) the language itself has a plurality of expression modes, one-hot characteristic representation cannot solve the problem of expression of words with the same meaning and different meanings, the technology assumes that characteristic items are mutually independent and obviously inconsistent with the actual situation, and (2) words which do not exist in a training sample may exist in a predicted text, so a dictionary constructed by the training sample cannot completely cover the words of the predicted text, and the predicted text characteristic can be lost to a certain extent.
The text classification based on the word vector model adopts a deep learning algorithm, so that although the classification effect is greatly improved, the deep learning model algorithm depends on a large amount of artificially labeled training data, a large amount of high-quality labeled data is difficult to obtain in a real environment, and the training data magnitude is not enough to enable numerous parameters of the deep neural network model to be completely converged. Furthermore, deep learning requires that a model structure must have sufficient Depth (Depth), usually requires more than 3 layers of hidden layer nodes, and may even reach 10 layers, and such a multi-layer nonlinear mapping structure maps an input original sample onto an original spatial feature, changes layer by layer, and maps to a new feature space, thereby possibly using the new feature to more easily implement classification or prediction, and a network function expresses an Overfitting (Overfitting) problem that may occur due to too strong capability.
In view of the above, the application provides a text data classification method and device, and the problems that a traditional machine learning model is poor in formal online performance of text classification, a deep learning model excessively depends on manual labeling data, and overfitting is easy to occur can be solved. Meanwhile, the unlabeled data is used for assisting in labeling the data training model and predicting the label value of the unlabeled data.
The content of the present application will be described in detail with reference to specific examples below:
fig. 1 is a system block diagram illustrating a text data classification method and apparatus according to an exemplary embodiment.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a shopping application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be a server that provides various services, such as a background server that supports shopping websites browsed by users using the terminal devices 101, 102, 103. The server 105 may perform processing such as analysis on the received text data such as the evaluation, and feed back the processing result (whether the evaluation is a good evaluation or a bad evaluation) to the relevant user.
The server 105 may, for example, pre-process the text data to be processed, generating a plurality of entries; the server 105 may determine word vectors for the plurality of terms, respectively, e.g., via a vocabulary dictionary; the server 105 may construct a text entry vector matrix, for example, with a plurality of entries and their corresponding word vectors; the server 105 may determine the class label of the text data to be processed, for example, by using the text entry vector matrix and a text classification model.
The server 105 may also train a word vector model through a plurality of historical text data to determine the vocabulary dictionary; the server 105 may train a machine learning algorithm, for example, through training data with class labels, to determine the text classification model.
The server 105 may be a single entity server, or may be composed of a plurality of servers, for example, it should be noted that the text data classification method provided by the embodiment of the present disclosure may be executed by the server 105, and accordingly, the text data classification apparatus may be disposed in the server 105. The web page end for browsing the goods and the data collecting end for evaluating the merchants are generally located in the terminal devices 101, 102, 103.
FIG. 2 is a flow diagram illustrating a method of text data classification in accordance with an exemplary embodiment. The text data classification method 20 includes at least steps S202 to S208.
As shown in fig. 2, in S202, the text data to be processed is preprocessed to generate a plurality of entries. Preprocessing may include word segmentation, stop removal, low frequency word removal, and the like. After preprocessing the word, all entries are connected with spaces.
In one embodiment, the word segmentation process may include Chinese word segmentation, which is the segmentation of a sequence of Chinese characters into a single word. Word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification. Existing word segmentation algorithms can be divided into three major categories: a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. Whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. In the embodiment of the present application, the text data to be processed may be subjected to word segmentation processing by one or more combinations of the above word segmentation methods.
In one embodiment, the Stop word processing may be Stop word processing, and in the information retrieval, in order to save storage space and improve search efficiency, some Words or Words are automatically filtered before or after processing natural language data or text, and the Words or Words are called Stop Words (Stop Words). These stop words are all manually entered, non-automatically generated. Stop words are broadly divided into two categories: common words and auxiliary words: the common words are widely applied, the real result cannot be guaranteed when the common words are used for data processing, and the reduction of the data processing range is difficult to help; the words such as assistant words, adverbs, prepositions, conjunctions, etc. usually have no definite meaning, and only put it in a complete sentence will have a certain function, such as "what" and "what" are commonly known. The stop-go process may remove the useless vocabulary described above.
In one embodiment, low frequency word processing is removed, such as to remove less common words for subsequent further processing.
In S204, word vectors are determined for the plurality of entries respectively by the vocabulary dictionary. The vocabulary dictionary may be, for example, a vocabulary dictionary generated by gensim and fasttext. Wherein genim is a tool for mining the semantic structure of a document by measuring phrase patterns. The fastText is an open-source word vector calculation and text classification tool, and in a text classification task, the fastText (shallow network) can achieve the precision which is equal to that of a deep network, but is many orders of magnitude faster than the deep network in the training time.
In one embodiment, a vocabulary dictionary of Word vectors may be constructed, for example, by the Word2vec method in genim, which is used to generate Word vector models, which may be a two-layer neural network, used to train to reconstruct Word text in linguistics, where the order of words is unimportant under the assumption of the bag-of-words model in Word2 vec. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, the vector being a hidden layer of the neural network.
In one embodiment, the word2vec word vector model may be trained over a plurality of historical text data to determine the vocabulary dictionary. For example, the historical text data is preprocessed to generate a plurality of historical entries; inputting the plurality of historical terms into a term vector model; training the word vector model through a plurality of historical text data to obtain a training result; and generating the vocabulary dictionary through the training result.
In one embodiment, determining a word vector for each of the plurality of terms from the vocabulary dictionary comprises: judging whether each entry of the plurality of entries has a corresponding vocabulary in the vocabulary dictionary; and when the vocabulary entry has a corresponding vocabulary in the vocabulary dictionary, representing the vocabulary entry by the word vector corresponding to the vocabulary.
In one embodiment, determining a word vector for each of the plurality of terms from the vocabulary dictionary comprises: and when the vocabulary entry does not have a corresponding vocabulary in the vocabulary dictionary, determining a word vector corresponding to the vocabulary through the similarity between the vocabulary entry and the vocabulary in the vocabulary dictionary. Determining word vectors corresponding to the vocabulary through the similarity between the vocabulary entry and the vocabulary in the vocabulary dictionary comprises the following steps: calculating the similarity between the entry and each vocabulary in the vocabulary dictionary; and representing the vocabulary with the word vector of the vocabulary with the maximum similarity.
In one embodiment, the cosine similarity between the vocabulary entry a and each vocabulary (vocabulary B, C, D, E) in the vocabulary dictionary may be calculated, for example, and the vocabulary (which may be D) in the vocabulary dictionary with the highest cosine similarity to the vocabulary entry may be used as the corresponding vocabulary of the vocabulary entry, and then the vocabulary a may be represented by the word vector of vocabulary D. The cosine similarity, also called cosine similarity, is evaluated by calculating the cosine value of the included angle between two vectors. The cosine similarity is to draw the vector into the vector space according to the coordinate value, to obtain the included angle, and to obtain the cosine value corresponding to the included angle, and the cosine value can be used to represent the similarity of the two vectors.
In S206, a text entry vector matrix is constructed from a plurality of entries and their corresponding word vectors. And corresponding each entry to the corresponding word vector one by one to generate an entry vector matrix.
In one embodiment, the entry with the most similar entry in the dictionary based on the word vector is calculated and replaced with a similar entry, the prediction samples are reorganized, the text entry features are constructed, and an entry vector matrix is generated.
In S208, the category label of the text data to be processed is determined by using the text entry vector matrix and a text classification model. The text classification model may be, for example, a random forest classification model.
The random forest is an algorithm for integrating a plurality of trees by the idea of ensemble learning, and the basic unit of the algorithm is a decision tree. The essence of the random forest algorithm is a classifier integration algorithm based on decision trees, wherein each tree depends on a random vector, and all vectors of the random forest are independently and identically distributed. The random forest is to randomize the column variables and row observations of the data set to generate a plurality of classification numbers, and finally summarize the classification tree results. Compared with a neural network, the random forest reduces the operation amount and improves the prediction precision, and the algorithm is insensitive to multivariate collinearity and is more stable to missing data and unbalanced data, so that the algorithm can be well adapted to thousands of interpretation variable data sets.
In one embodiment, a random forest classification model is trained with training data with class labels to determine the text classification model. Can include the following steps: generating training data through historical text data; labeling the training data; training a machine learning algorithm through the training data and the labels thereof; and generating the text classification model according to the parameters of the machine learning algorithm corresponding to the optimal training result.
According to the text data classification method disclosed by the invention, word vectors are respectively determined for the plurality of entries through the vocabulary dictionary, and the category labels of the text data to be processed are determined through the word vectors corresponding to the entries and the text classification model, so that the technical defects in the prior art can be overcome, and the accuracy and the efficiency of the text data classification method are improved.
According to the text data classification method disclosed by the invention, a text classification task is realized based on a machine learning classification model, a text representation is realized by utilizing a traditional vector space model, then, the most similar entries in a dictionary are obtained and replaced by the entries which are not covered by a training sample dictionary in a prediction sample in a similarity measurement mode. The feature information of the prediction sample is kept to be lost as much as possible, and the usability and the stability of the prediction effect of the machine learning classification model in the production environment are improved.
It should be clearly understood that this disclosure describes how to make and use particular examples, but the principles of this disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.
Fig. 3 is a flowchart illustrating a text data classification method according to another exemplary embodiment. The text data classification method 30 shown in fig. 3 is a detailed description of the construction of the vocabulary dictionary in S204 "determining the word vectors for the plurality of entries respectively through the vocabulary dictionary" in the flow shown in fig. 2.
As shown in fig. 3, in S302, the historical text data is preprocessed to generate a plurality of historical entries; for example: and preprocessing the historical text data according to a preset category to generate a plurality of historical entries. For example, historical data is uniformly sampled by categories from commodity evaluation data, and training data is generated after preprocessing.
In one embodiment, the preprocessing includes word segmentation and word stop, word tone removal, and the like, which are not limited in this application.
In S304, the plurality of historical terms are input into a word vector model. The training data may be loaded into gensim or fastText, for example. The FastText is a fast text classifier, provides a simple and efficient text classification and characterization learning method, and consists of two parts, namely a fastText text classification part and a word vector learning part. Word2 Vec: the method is an efficient tool for representing Words as real-valued vectors, and the adopted models include CBOW (Continuous Bag-Of-Words) and Skip-Gram. word2vec can simplify the processing of text content into a vector in a K-dimensional vector space through training, and the similarity on the vector space can be used for representing the semantic similarity of the text.
In S306, the word vector model is trained through a plurality of historical text data, and a training result is obtained. And training a Word2vec model after parameter tuning optimization, calculating an entry vector, and storing a Word2vec model result and a Word vector after the Word2vec model is operated.
In S308, the vocabulary dictionary is generated by the training result.
According to the text data classification method, word vectors are trained by using a Fastext or word2vec model, the entry which does not appear in a training dictionary in a prediction sample and the closest entry in the dictionary are obtained through similarity measurement calculation, the entry is replaced, prediction sample characteristic data are reconstructed and optimized, and classified text characteristic information is reserved to the maximum extent.
Fig. 4 is a flowchart illustrating a text data classification method according to another exemplary embodiment. The text data classification method 40 shown in fig. 4 is a detailed description of the text classification model in S208 "determine the category label of the text data to be processed by using the text entry vector matrix and the text classification model" in the flow shown in fig. 2,
as shown in fig. 4, in S402, training data is generated from the historical text data.
In S404, the training data is labeled.
In S406, a machine learning algorithm is trained by the training data and its labels.
In S408, the text classification model is generated according to the parameters of the machine learning algorithm corresponding to the optimal training result.
For example, training data is uniformly sampled according to categories to generate training data, training sample data labels are labeled manually, a domain dictionary is collected through data preprocessing (operations such as word segmentation, stop removal, and speech and mood word removal), the word frequency of each entry of each sample is generated by using a CountVectorizer, data conversion is carried out on the training data, and text-entry matrix data is generated. And (3) generating a classifier (random forest classifier) by applying a machine learning algorithm model, and debugging parameters to ensure that the classification effect of the classifier meets the online requirement.
According to the text data classification method disclosed by the invention, an unlabeled data set and a training word vector are utilized, a text optimization conversion module is added, a machine learning classification model prediction label is assisted, and the prediction capability of the classification model is expanded.
Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. When executed by the CPU, performs the functions defined by the above-described methods provided by the present disclosure. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.
Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
Fig. 5 is a block diagram illustrating a text data classification apparatus according to an exemplary embodiment. As shown in fig. 5, the text data classification device 50 includes: a preprocessing module 502, a word vector module 504, a matrix module 506, and a label module 508.
The preprocessing module 502 is configured to preprocess text data to be processed to generate a plurality of entries; the preprocessing can include word segmentation processing, stop-removing processing, low-frequency word-removing processing and the like. After preprocessing the word, all entries are connected with spaces.
The word vector module 504 is configured to determine word vectors for the plurality of entries through the vocabulary dictionary, respectively; the vocabulary dictionary may be, for example, a vocabulary dictionary generated by gensim and fasttext. Wherein genim is a tool for mining the semantic structure of a document by measuring phrase patterns. The fastText is an open-source word vector calculation and text classification tool, and in a text classification task, the fastText (shallow network) can achieve the precision which is equal to that of a deep network, but is many orders of magnitude faster than the deep network in the training time.
The matrix module 506 is configured to construct a text entry vector matrix by using a plurality of entries and word vectors corresponding to the entries; and
the label module 508 is configured to determine a category label of the text data to be processed by using the text entry vector matrix and a text classification model. The text classification model may be, for example, a random forest classification model.
According to the text data classification device disclosed by the invention, word vectors are respectively determined for the plurality of entries through the vocabulary dictionary, the word vectors corresponding to the entries are input into the text classification model, and then the category labels of the text data to be processed are determined, so that the technical defects in the prior art can be overcome, and the accuracy and the efficiency of the text data classification method are improved.
Fig. 6 is a block diagram illustrating a text data classification apparatus according to another exemplary embodiment. The text data classification device 60 further includes, in addition to the text data classification device 50: word vector training module 602, text classification training module 604.
The word vector training module 602 is configured to train a word vector model through a plurality of historical text data to determine a word vector model corresponding to the vocabulary dictionary. The method comprises the following steps: preprocessing historical text data to generate a plurality of historical entries; inputting the plurality of historical terms into a term vector model; training the word vector model through a plurality of historical text data to obtain a training result; and generating the vocabulary dictionary through the training result.
The text classification training module 604 is configured to train a machine learning algorithm through training data with class labels to determine the text classification model. The method comprises the following steps: generating training data through historical text data; labeling the training data; training a machine learning algorithm through the training data and the labels thereof; and generating the text classification model according to the parameters of the machine learning algorithm corresponding to the optimal training result.
FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.
An electronic device 200 according to this embodiment of the present disclosure is described below with reference to fig. 7. The electronic device 200 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, the electronic device 200 is embodied in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.
Wherein the storage unit stores program code executable by the processing unit 210 to cause the processing unit 210 to perform the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, the processing unit 210 may perform the steps as shown in fig. 2, 3, 4.
The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.
The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 200 may also communicate with one or more external devices 300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 200 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiments of the present disclosure.
Fig. 8 schematically illustrates a computer-readable storage medium in an exemplary embodiment of the disclosure.
Referring to fig. 8, a program product 400 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: preprocessing text data to be processed to generate a plurality of entries; determining word vectors for the entries respectively through a vocabulary dictionary; constructing a text entry vector matrix through a plurality of entries and corresponding word vectors thereof; and determining the category label of the text data to be processed by the text entry vector matrix and a text classification model.
Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
Exemplary embodiments of the present disclosure are specifically illustrated and described above. It is to be understood that the present disclosure is not limited to the precise arrangements, instrumentalities, or instrumentalities described herein; on the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (13)

1. A method of classifying text data, comprising:
preprocessing text data to be processed to generate a plurality of entries;
determining word vectors for the entries respectively through a vocabulary dictionary;
constructing a text entry vector matrix through a plurality of entries and corresponding word vectors thereof; and
and determining the category label of the text data to be processed by the text entry vector matrix and a text classification model.
2. The method of claim 1, further comprising:
training a word vector model through a plurality of historical text data to determine the vocabulary dictionary.
3. The method of claim 2, wherein training a word vector model through a plurality of historical text data to determine the vocabulary dictionary comprises:
preprocessing historical text data to generate a plurality of historical entries;
inputting the plurality of historical terms into a term vector model;
training the word vector model through a plurality of historical text data to obtain a training result; and
and generating the vocabulary dictionary according to the training result.
4. The method of claim 3, wherein preprocessing the historical text data to generate a plurality of historical terms comprises:
and preprocessing the historical text data according to a preset category to generate a plurality of historical entries.
5. The method of claim 1, further comprising:
training a machine learning algorithm through training data with class labels to determine the text classification model.
6. The method of claim 5, wherein training a machine learning algorithm with training data with class labels to determine the text classification model comprises:
and training a random forest classifier through training data with class labels to determine the text classification model.
7. The method of claim 5, wherein training a machine learning algorithm with training data with class labels to determine the text classification model comprises:
generating training data through historical text data;
labeling the training data;
training a machine learning algorithm through the training data and the labels thereof; and
and generating the text classification model according to the parameters of the machine learning algorithm corresponding to the optimal training result.
8. The method of claim 1, wherein determining a word vector for each of the plurality of terms via a vocabulary dictionary comprises:
judging whether each entry of the plurality of entries has a corresponding vocabulary in the vocabulary dictionary; and
and when the vocabulary entry has a corresponding vocabulary in the vocabulary dictionary, representing the vocabulary entry by the word vector corresponding to the vocabulary.
9. The method of claim 8, wherein determining a word vector for each of the plurality of terms via a vocabulary dictionary comprises:
and when the vocabulary entry does not have a corresponding vocabulary in the vocabulary dictionary, determining a word vector corresponding to the vocabulary through the similarity between the vocabulary entry and the vocabulary in the vocabulary dictionary.
10. The method of claim 9, wherein determining word vectors corresponding to words in the vocabulary dictionary based on similarity of the terms to the words comprises:
calculating the similarity between the entry and each vocabulary in the vocabulary dictionary; and
the vocabulary with the largest similarity is represented by the word vector of the vocabulary.
11. A text data classification apparatus, comprising:
the preprocessing module is used for preprocessing the text data to be processed to generate a plurality of entries;
the word vector module is used for respectively determining word vectors for the plurality of entries through the vocabulary dictionary;
the matrix module is used for constructing a text entry vector matrix through a plurality of entries and word vectors corresponding to the entries; and
and the label module is used for determining the category label of the text data to be processed by the text entry vector matrix and a text classification model.
12. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-10.
13. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-10.
CN201811442384.7A 2018-11-29 2018-11-29 Text data classification method and device, electronic equipment and computer readable medium Pending CN111241273A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811442384.7A CN111241273A (en) 2018-11-29 2018-11-29 Text data classification method and device, electronic equipment and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811442384.7A CN111241273A (en) 2018-11-29 2018-11-29 Text data classification method and device, electronic equipment and computer readable medium

Publications (1)

Publication Number Publication Date
CN111241273A true CN111241273A (en) 2020-06-05

Family

ID=70864818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811442384.7A Pending CN111241273A (en) 2018-11-29 2018-11-29 Text data classification method and device, electronic equipment and computer readable medium

Country Status (1)

Country Link
CN (1) CN111241273A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113114679A (en) * 2021-04-13 2021-07-13 中国工商银行股份有限公司 Message identification method and device, electronic equipment and medium
CN113743077A (en) * 2020-08-14 2021-12-03 北京京东振世信息技术有限公司 Method and device for determining text similarity
CN115310564A (en) * 2022-10-11 2022-11-08 北京睿企信息科技有限公司 Classification label updating method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743077A (en) * 2020-08-14 2021-12-03 北京京东振世信息技术有限公司 Method and device for determining text similarity
CN113743077B (en) * 2020-08-14 2023-09-29 北京京东振世信息技术有限公司 Method and device for determining text similarity
CN113114679A (en) * 2021-04-13 2021-07-13 中国工商银行股份有限公司 Message identification method and device, electronic equipment and medium
CN115310564A (en) * 2022-10-11 2022-11-08 北京睿企信息科技有限公司 Classification label updating method and system

Similar Documents

Publication Publication Date Title
CN109493977B (en) Text data processing method and device, electronic equipment and computer readable medium
CN107491534B (en) Information processing method and device
CN107679039B (en) Method and device for determining statement intention
CN111680159B (en) Data processing method and device and electronic equipment
CN111709240A (en) Entity relationship extraction method, device, equipment and storage medium thereof
CN112508691B (en) Risk prediction method and device based on relational network labeling and graph neural network
US11651015B2 (en) Method and apparatus for presenting information
CN109359180B (en) User portrait generation method and device, electronic equipment and computer readable medium
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN112528654A (en) Natural language processing method and device and electronic equipment
CN111241273A (en) Text data classification method and device, electronic equipment and computer readable medium
CN108268629B (en) Image description method and device based on keywords, equipment and medium
CN109902152B (en) Method and apparatus for retrieving information
CN117807482B (en) Method, device, equipment and storage medium for classifying customs clearance notes
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
US11361031B2 (en) Dynamic linguistic assessment and measurement
CN111190967A (en) User multi-dimensional data processing method and device and electronic equipment
CN112711943B (en) Uygur language identification method, device and storage medium
CN110807097A (en) Method and device for analyzing data
CN113761895A (en) Text abstract generation method and device, electronic equipment and storage medium
CN112100360A (en) Dialog response method, device and system based on vector retrieval
CN115827865A (en) Method and system for classifying objectionable texts by fusing multi-feature map attention mechanism
CN113569578B (en) User intention recognition method and device and computer equipment
CN112732896B (en) Target information display method, device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination