CN111767403A

CN111767403A - Text classification method and device

Info

Publication number: CN111767403A
Application number: CN202010644879.9A
Authority: CN
Inventors: 刘志煌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2020-10-13
Anticipated expiration: 2040-07-07
Also published as: CN111767403B

Abstract

The embodiment of the application discloses a text classification method and a text classification device; the embodiment of the application is related to the field of big data and the field of artificial intelligence natural language processing; the method includes the steps that a text to be classified and a word bank for text classification are obtained; performing word segmentation on a text to be classified to obtain a plurality of text words and word sequence information of the text words; determining target category keywords existing in the text to be classified according to the text words and the category keywords; determining a target text word associated with the target category keyword from the text to be classified based on word sequence information of the text word and the target category keyword; classifying the target text words corresponding to the target category keywords on a preset category based on the positive characteristic words and the negative characteristic words to obtain a classification result of the target category keywords; integrating the classification result of each target class keyword to obtain the class of the text to be classified; the scheme can improve the accuracy of the classification result of text classification.

Description

Text classification method and device

Technical Field

The application relates to the field of data processing, in particular to a text classification method and device.

Background

The information available on the internet is more and more extensive due to the development of the technology, the data size is correspondingly more and more huge, in order to more efficiently and quickly acquire actually required target data, massive data needs to be processed, for example, text data can be classified, in the prior art, keyword search can be performed on the text data to realize text data classification, and finally, inappropriate parts in the text data can be removed. In the process of research and practice of the prior art, the inventor of the application finds that the classification result obtained by the prior art has low accuracy.

Disclosure of Invention

The embodiment of the application provides a text classification method and device, which can improve the accuracy of classification results of text classification.

The embodiment of the application provides a text classification method, which comprises the following steps:

acquiring a text to be classified and a word bank for text classification, wherein the word bank comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories;

performing word segmentation on the text to be classified to obtain a plurality of text words and word sequence information of the text words;

determining target category keywords existing in the text to be classified according to the text words and the category keywords;

determining a target text word associated with the target category keyword from the text to be classified based on the word sequence information of the text word and the target category keyword;

classifying the target text words corresponding to the target category keywords on the preset category based on the positive characteristic words and the negative characteristic words to obtain a classification result of the target category keywords;

and integrating the classification result of each target class keyword to obtain the class of the text to be classified.

Correspondingly, an embodiment of the present application provides a text classification apparatus, including:

the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring a text to be classified and a word bank for text classification, and the word bank comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories;

the word segmentation module is used for segmenting the text to be classified to obtain a plurality of text words and word sequence information of the text words;

the first determining module is used for determining target category keywords existing in the text to be classified according to the text words and the category keywords;

the second determining module is used for determining a target text word associated with the target category keyword from the text to be classified based on the word sequence information of the text word and the target category keyword;

the classification module is used for classifying the target text words corresponding to the target category keywords on the preset category based on the positive characteristic words and the negative characteristic words to obtain a classification result of the target category keywords;

and the integration module is used for integrating the classification result of each target class keyword to obtain the class of the text to be classified.

In some embodiments of the present application, the classification module includes a classification sub-module and an integration sub-module, wherein,

the classification sub-module is used for classifying each target text word corresponding to the target category keywords on the preset category based on the positive characteristic words and the negative characteristic words to obtain a classification result of each target text word;

and the integration sub-module is used for integrating the classification result of each target text word corresponding to the target category keyword to obtain the classification result of the target category keyword.

In some embodiments of the present application, the classification submodule comprises a statistical unit and a calculation unit, wherein,

the counting unit is used for respectively counting the occurrence frequency of the positive characteristic words and the negative characteristic words in all the text words to obtain the positive word frequency and the negative word frequency of the target category keywords in the preset category;

and the calculating unit is used for carrying out classification calculation on each target text word corresponding to the target category keywords based on the positive word frequency and the negative word frequency to obtain a classification result of each target text word.

In some embodiments of the present application, the integration sub-module comprises a counting unit and a determining unit, wherein,

the counting unit is used for respectively counting the target text words of which the classification results are positive categories and negative categories to obtain positive quantity and negative quantity;

and the determining unit is used for determining the classification result of the target category key words based on the positive quantity and the negative quantity.

In some embodiments of the present application, the determining unit is specifically configured to:

when the positive quantity is larger than the negative quantity, determining that the classification result of the target category keyword is a positive category;

and when the positive quantity is smaller than the negative quantity, determining that the classification result of the target category keyword is a negative category.

In some embodiments of the present application, the text classification apparatus further includes:

the data acquisition module is used for acquiring a class reference word of a preset class and sample data;

the expansion module is used for performing near sense expansion on the category reference words to obtain category keywords of the preset categories;

and the processing module is used for processing the sample data based on the category keywords, determining the positive characteristic words and the negative characteristic words of the preset categories to obtain a word stock, wherein the word stock comprises the category keywords, the positive characteristic words and the negative characteristic words corresponding to the preset categories.

In some embodiments of the present application, the processing module is specifically configured to:

based on the category keywords, dividing the sample data to obtain positive sample data and negative sample data;

and respectively mining the feature words of the positive sample data and the negative sample data based on a preset threshold and the category keywords, and determining the positive feature words and the negative feature words of the preset category.

In some embodiments of the present application, the thesaurus includes category keywords, positive feature words, and negative feature words corresponding to a plurality of preset categories, and the classification module is specifically configured to:

classifying the target text words corresponding to the target category keywords on each preset category based on the positive keywords and the negative keywords of each preset category to obtain a classification result of the target category keywords on each preset category;

in this case, the integration module includes an integration submodule, wherein,

and the integration submodule is used for integrating the classification result of each target class keyword on each preset class and determining the class of the text to be classified.

In some embodiments of the present application, each of the preset categories includes a positive sub-category and a negative sub-category, and the integration sub-module is specifically configured to:

determining the subcategory of the text to be classified in each preset category according to the subcategory of each target category keyword in each preset category;

and integrating sub-categories of the text to be classified on all preset categories to obtain the category of the text to be classified.

Correspondingly, the embodiment of the present application further provides a storage medium, where a computer program is stored, and the computer program is suitable for being loaded by a processor to execute any one of the text classification methods provided in the embodiment of the present application.

Accordingly, embodiments of the present application further provide a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements any one of the text classification methods provided in the embodiments of the present application when executing the computer program.

The method comprises the steps of determining target category keywords in a text to be classified through category keywords of preset categories in a word bank, determining target text words related to the target category keywords in the text to be classified through word sequence information of the text words obtained after word segmentation of the text to be classified, classifying the target text words, determining a classification result of the target category keywords through a classification result of each target text word of the target category keywords, and finally determining the category of the text to be classified based on the classification result of each target category keyword of the text to be classified.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic scene diagram of a text classification apparatus provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a text classification method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating an application scene effect before applying a text classification method according to an embodiment of the present application;

fig. 4 is a schematic diagram of an application scene effect to which a text classification method is applied according to an embodiment of the present application;

FIG. 5 is another schematic flow chart diagram of a text classification method provided in the embodiment of the present application;

fig. 6 is a schematic flowchart of a spam text classification method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a text classification apparatus according to an embodiment of the present application;

fig. 8 is another schematic structural diagram of a text classification apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the embodiments described in the present application are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The text classification method provided by the embodiment of the application relates to the field of artificial intelligence, in particular to the field of machine learning in the field of artificial intelligence.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The processes of near-meaning expansion, word segmentation and the like in the embodiment of the application relate to natural language processing and other technologies in the field of artificial intelligence, the processes of near-meaning expansion, word segmentation and the like can be completed through artificial intelligence natural language processing numbers, and specific contents are explained through the embodiment.

The embodiment of the application provides a text classification method and device. Specifically, the embodiment of the present application may be integrated in a text classification device, and the text classification device may be integrated in a text classification computer device, where the text classification computer device may be an electronic device such as a terminal, and the terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like, but is not limited thereto; the text classification computer device may also be an electronic device such as a server, and the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

As shown in fig. 1, fig. 1 is a scene schematic diagram of a text classification device according to an embodiment of the present application. The terminal can acquire a text to be classified and a word bank for text classification, wherein the word bank comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories; performing word segmentation on a text to be classified to obtain a plurality of text words and word sequence information of the text words; determining target category keywords existing in the text to be classified according to the text words and the category keywords; determining a target text word associated with the target category keyword from the text to be classified based on word sequence information of the text word and the target category keyword; classifying the target text words corresponding to the target category keywords on a preset category based on the positive characteristic words and the negative characteristic words to obtain a classification result of the target category keywords; and integrating the classification result of each target class keyword to obtain the class of the text to be classified.

The text classification process can be completed through cooperation of the terminal and the server, for example, the terminal can transmit a text to be classified input by a user to the server, the server can receive the text to be classified transmitted by the terminal and acquire a word bank for text classification, then the server can complete the text classification process through the text classification method, obtain the category of the text to be classified, and trigger further operation of the terminal and the server based on the category.

It should be noted that the scene schematic diagram of the text classification device shown in fig. 1 is merely an example, and the text classification device and the scene described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application.

The following are detailed below.

In this embodiment, the text classification device will be described from the perspective of a text classification device, which may be specifically integrated in a terminal, such as a terminal equipped with a storage unit and a microprocessor, such as a camera, a video camera, a smart phone, a tablet computer, a notebook computer, a personal computer, and a wearable smart device.

As shown in fig. 2, fig. 2 is a schematic flowchart of a text classification method according to an embodiment of the present application. The text classification method can comprise the following steps:

101. the method comprises the steps of obtaining a text to be classified and a word bank for text classification, wherein the word bank comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories.

The text to be classified may include text data to be classified, and the text data may be edited by a user, for example, in an edit box of an application client, the user may manually input data information including the text data; the text data may also be automatically generated by a particular system based on a particular function, such as a text generation system (e.g., marketing number generator, garbage generator, etc.) that may automatically generate a preset number of words based on a particular topic, etc. The text to be classified can be a sentence or paragraph text, and the text to be classified can include basic elements such as punctuation marks, operation characters, numbers, letters, and words (such as words in different languages like chinese and english).

The preset categories may include summary information about data contents with similar texts, and common preset categories may include multiple categories, such as evaluation categories (e.g., movie evaluation, food evaluation, etc.), illegal categories (e.g., gambling, pornography, drug information, etc.), advertisements, etc., for example, category keywords of advertisement texts may include remuneration, building jump price, fakes, etc.; the category keywords of the food evaluation-type text may include deliciousness, fruits, braising, etc.

The category keywords may include words or combinations of words, the usage frequency of which is greater than a preset threshold when a plurality of people describe content of a preset category, the category keywords are important content for achieving efficient and accurate text classification, and in the process of determining the category keywords, there may be a plurality of ways, for example, word segmentation may be performed on text data samples of a plurality of preset categories, word frequency statistics may be performed on all words obtained by the word segmentation, and then, it is determined that the words greater than the preset word frequency are the category keywords of the preset category.

For another example, the category keywords may be determined manually by a developer or a user, the time for determining manually may be before text classification is started, or may be during the text classification process, and relevant personnel including the developer or the user may perform operations such as adding or deleting the category keywords according to actual needs, and the adjustable category keywords make the scheme more flexible, and may be converted in time in different scenes, thereby ensuring high accuracy of the classification result.

The category keywords may be high-frequency words describing a preset category, but cannot represent text data including the category keywords, that is, the text data belongs to the preset category, in order to ensure high accuracy of text classification, common feature words except the category keywords in the text data need to be grouped together and serve as feature words of the preset category, where the feature words may include positive feature words and negative feature words, the positive feature words and the negative feature words may represent the common feature words belonging to the preset category and not belonging to the preset category, the positive feature words and the negative feature words are similar to the category keywords, and may be words, and combinations of characters (including letters, characters, numbers, symbols and the like) representing specific meanings, such as a goods, poplar woods, + V and the like.

The positive characteristic words and the negative characteristic words can be determined through text data samples of preset categories, wherein the determination mode can be various, for example, the determination mode can be manually determined by related personnel, and the method can be suitable for being used when the number of the text data samples is small, or high-quality and accurate characteristic words cannot be obtained through other modes; determining the characteristic words through the trained neural network model; feature word mining may also be performed on text data samples by algorithms, and so on. The method for determining the feature words can be flexibly selected according to the characteristics of the text data samples and the actual requirements, and is not described herein any more.

For example, the preset category may be an advertisement, the text to be recognized may be "78666 with one-to-many-quarter high imitation complete brand satellite discount same number", the category keyword in the word bank including the advertisement may be "78666 with high imitation, many-quarter, inspection, original factory, tail sheet, WeChat, satellite, V letter", the positive characteristic word of the advertisement may be "original factory, real shoot, consult, detail, Putian, poison, recognition, tiger pounding", and the negative characteristic word of the advertisement may be "reject, trade, illegal, rampant, industrial, severe, infringement".

In some embodiments of the present application, the text recognition method may further include the steps of:

acquiring a category reference word and sample data of a preset category; performing near-sense expansion on the category reference words to obtain category keywords of preset categories; based on the category keywords, processing the sample data, determining positive characteristic words and negative characteristic words of preset categories, and obtaining a word stock, wherein the word stock comprises the category keywords, the positive characteristic words and the negative characteristic words corresponding to the preset categories.

The category reference word may be a basic word whose usage frequency is greater than a preset threshold when describing the content of the preset category, for example, "WeChat" is a category reference word, and then, the category reference word may be subjected to near-sense expansion, for example, "WeChat" may be expanded to "satellite, V letter, contact, and + V", and the like, and the near-sense expansion may be performed in various ways, and may be performed through a dictionary (such as a synonym forest), a neural network model (such as a word vector model), or may be expanded by a related person (such as a developer, a user, and the like). After the near-meaning expansion is completed, the category reference word and the word obtained by near-meaning expansion of the category reference word may be used as the category keyword of the preset category, for example, the category keyword of the advertisement category may include "WeChat, satellite, V letter, contact address, + V".

The sample data is a text data sample, the sample data can be a text related to a set category, the sample data is processed based on a category keyword, and obtaining a positive feature word and a negative feature word is a key step for realizing text classification, the positive feature word and the negative feature word are key reference information required by a text classification method, and the processing process can be flexibly selected according to actual conditions and is not repeated herein.

For example, a category reference word "high-imitation" of an advertisement class is obtained, the category reference word is subjected to near-meaning expansion through a synonym forest, so that "fine-imitation, a-goods, factory goods and original factory" can be obtained, then "high-imitation, fine-imitation, a-goods, factory goods and original factory" can be used as a category keyword of the advertisement class, then, the obtained sample data can be processed through the category key word, the sample data can be "fake-selling and fake-selling, the industrial and commercial departments punch a lot, so-called original factory sneakers are stricken vigorously, and" when a tide shoe purchases a large brand with a satellite number 1234 ", positive feature words and negative feature words of the advertisement class are obtained, the obtained positive feature words can be" rampant and hit ", and the obtained negative feature words can be" discount and brand ".

In some embodiments of the present application, the step "processing sample data based on a category keyword, and determining a positive feature word and a negative feature word of a preset category" may include:

based on the category keywords, dividing the sample data to obtain positive sample data and negative sample data; feature word mining is respectively carried out on the positive sample data and the negative sample data based on a preset threshold and category keywords, and positive feature words and negative feature words of a preset category are determined.

Specifically, sample data related to a preset category may be divided, and if the sample data belongs to the preset category, the sample data is positive sample data, and if the sample data does not belong to the preset category, the sample data is negative sample data. The division can be performed through manual division (manual labeling) or automatic division through an algorithm or a trained neural network model.

After the sample data is divided, a positive feature word of a preset category may be determined based on the positive sample data, a negative feature word of the preset category may be determined based on the negative sample data, and specifically, feature word mining may be performed on the positive/negative sample data, for example, mining of the positive feature word and the negative feature word may be performed through a Prefix projection mode mining (Prefix-Projected pattern growth) algorithm.

For example, first, the positive/negative sample data may be preprocessed, where the preprocessing may filter irrelevant information such as punctuation marks and numbers in the positive/negative sample data (e.g., regular filtering), then filter category keywords existing in the positive/negative sample data to obtain the filtered positive/negative sample data, perform word segmentation on the filtered positive/negative sample data to obtain a plurality of sample words, and the word segmentation process may be performed by a word segmentation tool (e.g., jiba word segmentation).

Then, feature word mining can be performed on the sample data, and the mining process can include:

1. finding out a word sequence prefix with unit length of 1 and a corresponding projection data set;

2. counting the occurrence frequency of the word sequence prefixes, and adding the word sequence prefixes with the occurrence frequency higher than the minimum support degree threshold value into a data set to obtain a frequent word sequence with the frequency of i being 1;

3. performing recursive mining on all word sequence prefixes which have the length of i and meet the requirement of minimum support degree:

1) mining a projection data set of the prefix of the word sequence, and returning to recursion if the projection data is an empty set; 2) counting the minimum support degree of each single item in the corresponding projection data set, combining each single item meeting the minimum support degree with the current prefix to obtain a new prefix, and recursively returning if the minimum support degree requirement is not met; 3) making i equal to i +1, and performing recursion in the step 3 respectively, wherein the word sequence prefixes are the combined new prefixes;

4. and returning all frequent word sequences in the word sequence data set.

The frequent word sequence is a feature word, and feature word mining is respectively carried out on positive sample data and negative sample data through the mode, so that a positive feature word and a negative feature word can be obtained.

The minimum support degree can be determined based on the minimum support rate, the minimum support rate can be flexibly adjusted according to factors such as preset types and number of sample data in the practical process, and the calculation method of the minimum support degree can be as follows:

min_sup＝a×n

wherein min _ sup is the minimum support degree, a is the minimum support rate, and n is the number of sample data.

102. And performing word segmentation on the text to be classified to obtain a plurality of text words and word sequence information of the text words.

The text words may include a number of words constituting the text to be classified, the word order information of the text words may be order information of the text words in the text to be classified,

the sentence is composed of a plurality of words, different sentences can be composed by different appearance sequences of the words in the sentence, different meanings are conveyed, for computer equipment, the sentence is divided into correct words which are word segmentation, and word sequence information of text words is further included based on the important function of the sequence of the words in the sentence.

In practice, word segmentation can be performed by a word segmentation tool, common principles of the word segmentation tool can include dictionary-based, machine learning-based, and the like, and common word segmentation tools can include a Paecilomyces pustulus, a Chinese character segmentation, a Stanford segmentation device, and the like.

In order to obtain a word segmentation result with higher accuracy, the text to be classified may be preprocessed before word segmentation, such as to screen out useless characters, and the like.

The method and the device can be used for carrying out classification detection on partial words in the text to be classified so as to determine the category of the text to be classified, and carrying out word segmentation on the text to be classified so as to obtain a plurality of text words, thereby being the basis for realizing classification detection on partial words.

For example, segmenting the word "78666 with one-to-one repeated high imitation complete brand satellite discount same number", preprocessing the word to screen out the numbers therein to obtain "78666 with one-to-one repeated high imitation complete brand satellite discount same number", and then segmenting the word to obtain the text word and the word sequence information of the text word, including: one-to-one (1), repeated engraving (2), high imitation (3), brand (4), complete (5), satellite (6), buckle (7) and same number (8).

103. And determining target category keywords existing in the text to be classified according to the text words and the category keywords.

The target analogy keywords may be category keywords existing in the text to be classified, specifically, whether text words identical to the category keywords exist in the text words of the text to be classified or not may be searched, and if yes, the target analogy keywords are the target category keywords.

For example, the category keywords may include "high imitation, repeated carving, goods inspection, original factory, tail bill, WeChat, satellite, and V letter", and each category keyword is searched in the text words of the text to be classified, and it can be finally determined that the target category keywords of the text to be classified include "repeated carving", "high imitation", and "satellite".

104. And determining the target text words associated with the target category keywords from the text to be classified based on the word sequence information of the text words and the target category keywords.

The target text words may include text words associated with the target category keywords in the text to be classified, specifically, the position association may be adjacent text words in the context of the target category keywords, the number of the adjacent text words may be flexibly determined according to actual requirements, for example, may be set to 2 or 3, and the like.

For example, for the target category keyword "satellite" in "one-to-one repeated high imitation full-brand satellite withholding number 78666", the corresponding target text words may be "brand", "full", "withholding", "same number".

105. And classifying the target text words corresponding to the target category keywords on the preset category based on the positive characteristic words and the negative characteristic words to obtain a classification result of the target category keywords.

And classifying the target text words corresponding to the target category keywords to further determine the classification result of the target keywords, wherein the classification result of the target category keywords and the classification result of the target text words can be two types, namely, the target text words belong to the preset category (positive category) and the target text words do not belong to the preset category (negative category).

For example, the positive characteristic words of the advertisement can include "original factory, real shooting, consultation, detail, Pu Tian, poison, goods identification, tiger pounding", and the negative characteristic words of the advertisement can include "resisting, trading, illegal, rampant, striking, industrial, commercial, serious, infringement", and the target text words "brand", "complete", "withholding" and "same number" corresponding to the target category keyword "satellite" of the text to be classified "one-to-one-time high-imitation complete satellite withholding same number 78666" are classified, so that the classification result of the target category keyword "satellite" of the text to be classified can be determined as the positive category.

In some embodiments of the present application, the step "classifying target text words corresponding to the target category keywords on a preset category based on the positive direction feature words and the negative direction feature words to obtain a classification result of the target category keywords" may include:

classifying each target text word corresponding to the target category keywords on a preset category based on the positive characteristic words and the negative characteristic words to obtain a classification result of each target text word; and integrating the classification result of each target text word corresponding to the target category keywords to obtain the classification result of the target category keywords.

The target text words of one target category keyword can be multiple, so that after the target text words corresponding to the target category keyword are classified, multiple classification results can be obtained, and the classification results of the target category keyword can be obtained after the multiple classification results are integrated. The integration may be performed by performing weighted calculation on the classification results, and the weight of each classification result may be flexibly set, for example, based on word order information of the target text word relative to the target category keyword.

For example, the target text words "brand", "complete", "withhold" and "same number" are classified to obtain classification results of "positive category", "positive category" and "negative category" in turn, and then the classification results may be integrated to obtain a classification result "positive category" of the target category keyword "satellite" corresponding to the target text words.

respectively counting the occurrence frequency of the positive characteristic words and the negative characteristic words in all the text words to obtain the positive word frequency and the negative word frequency of the target category keywords in the preset category; and performing classification calculation on each target text word corresponding to the target category keywords based on the positive word frequency and the negative word frequency to obtain a classification result of each target text word.

When the texts to be classified are classified, the number of the texts to be classified can be at least one, before each target text word is classified, the occurrence probability of all positive characteristic words and all negative characteristic words, the occurrence frequency of the target text words and the probability of each target text word and each positive/negative characteristic word appearing simultaneously can be counted in all the texts to be classified, then, the calculation result of the target text word can be obtained through calculation through an algorithm, and the classification result of the target text word is further determined.

For example, the classification calculation of the target text word corresponding to a target category keyword "satellite" in the text to be classified can be performed by a emotional tendency point Mutual Information algorithm (PMI),

after the target text word and all the feature words are subjected to mutual information calculation, the emotional tendency mutual information SO _ PMI of the word of the target text word can be obtained:

wherein, P_setIs a set containing forward feature words of a preset category, pw is a forward feature word, N_setThe negative characteristic words are a set containing the negative characteristic words of the preset category, and nw is the negative characteristic words.

The calculation formula of the mutual information PMI may be:

wherein, w₁Is a target text word, w₂Are feature words.

Specifically, the target text word of "satellite" may include "brand", "complete", "discount" and "same number", the positive feature word may be "original factory, consult", and the negative feature word may be "resist, hit". For example, before determining the classification result of the target text word "brand", P (original factory), P (consultation), P (rejection), P (strike), P (brand), and P (brand, original factory) (i.e., the probability of the brand and original factory appearing at the same time), P (brand, consultation), P (brand, rejection), and P (brand, strike) may be determined in all the texts to be classified, and based on these probability values, the relevant calculation result of the target text word "brand" is obtained, and then the classification result of the target text word is determined, and the calculation formula may be as follows:

SO _ PMI (brand) ═ PMI (brand, original factory) + PMI (brand, consultative) -PMI (brand, resistant) -PMI (brand, hit)

The calculation formula of PMI (brand, original factory) may be as follows:

when the result of the SO _ PMI (brand) is greater than 0, the classification result of the target text word "brand" may be determined as a positive category, and when the result of the SO _ PMI (brand) is less than 0, the classification result of the target text word "brand" may be determined as a negative category.

After classification calculation is sequentially completed on the target text words "brand", "complete", "withhold" and "same number", classification results "positive type", "positive type" and "negative type" can be obtained.

In some embodiments of the present application, the classification result includes a positive category and a negative category, and the step "integrating the classification result of each target text word corresponding to the target category keyword to obtain the classification result of the target category keyword" may include:

respectively counting the target text words with the classification results of positive categories and negative categories to obtain positive quantity and negative quantity; and determining a classification result of the target category keywords based on the positive quantity and the negative quantity.

For example, the counting result may be that the positive number is 3 and the negative number is 1, i.e., the classification result of the target category keyword "satellite" may be determined as the positive category based on the counting result.

In some embodiments of the present application, the step "determining a classification result of the target category keyword based on the positive number and the negative number" may include:

when the positive quantity is larger than the negative quantity, determining the classification result of the target category key words as a positive category; and when the positive quantity is smaller than the negative quantity, determining that the classification result of the target category key words is a negative category.

In addition, when the positive quantity is equal to the negative quantity, the classification result of the target category keyword may be a positive category or a negative category, and may be flexibly set according to actual requirements during operation.

For example, if the positive direction number 3 is greater than the negative direction number 1, the classification result of the target category keyword may be determined to be the positive direction category.

106. And integrating the classification result of each target class keyword to obtain the class of the text to be classified.

The categories of the text to be classified can comprise a positive category and a negative category, the target category keywords of the text to be classified can comprise a plurality of keywords, and the category of the text to be classified can be determined by integrating the classification result of each target category keyword.

For example, the classification result of each target category keyword may be counted, and the most numerous category is the category of the text to be classified.

In some embodiments of the present application, the word stock includes a plurality of category keywords, positive feature words, and negative feature words corresponding to preset categories, and the step "classifying the target text words on the preset categories based on the positive feature words and the negative feature words to obtain a classification result of the target category keywords" may include:

and classifying the target text words corresponding to the target category keywords on each preset category based on the positive keywords and the negative keywords of each preset category to obtain a classification result of the target category keywords on each preset category.

At this time, the step "integrating the classification result of each target category keyword to obtain the category of the text to be classified" may include:

and integrating the classification result of each target class keyword on each preset class, and determining the class of the text to be classified.

In an actual application scenario, text classification may be directed to multiple categories, for example, text categories such as advertisements, illegal words, vulgar words, and the like may be collectively referred to as spam categories, and therefore, when text classification is performed on a text to be classified, it is necessary to determine whether the text belongs to the multiple categories, so as to determine the category (spam category or non-spam category) of the text to be classified.

Therefore, after determining the target category keywords in the text to be classified, the target text words corresponding to the target category keywords are classified in each preset category based on the positive characteristic words and the negative characteristic words of each preset category, so as to obtain the classification result of the target category keywords in each preset category. And integrating the classification results of the target keywords of the text to be classified on each preset classification to determine the classification of the text to be classified.

In some embodiments of the present application, each preset category includes a positive sub-category and a negative sub-category, and the step "integrating the classification result of each target category keyword on each preset category to determine the category of the text to be classified" may include:

determining the subcategory of the text to be classified in each preset category according to the subcategory of each target category keyword in each preset category; and integrating sub-categories of the text to be classified on all preset categories to obtain the category of the text to be classified.

In each preset category, the classification and identification method finally determines whether the text to be classified belongs to each preset category, namely each preset category comprises a positive sub-category (belonging to the preset category) and a negative sub-category (not belonging to the preset category), and after the sub-categories of the text to be classified on each preset category are determined, the category of the text to be classified can be determined according to the sub-categories on all the preset categories, for example, the identification results of the text to be classified on an advertisement category and a low popular term category are negative sub-categories, and the identification result of an illegal category is a positive sub-category, the text to be recognized can be determined as a junk category, specifically an illegal category in the junk category; for another example, if the recognition result of the text to be classified in the advertisement category, the low-popular term category and the illegal category is a negative sub-category, all sub-categories can be integrated to determine that the text to be classified is a non-spam category.

Big data (Big data) refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which can have stronger decision-making power, insight discovery power and flow optimization capability only by a new processing mode. With the advent of the cloud era, big data has attracted more and more attention, and the big data needs special technology to effectively process a large amount of data within a tolerance elapsed time. The method is suitable for the technology of big data, and comprises a large-scale parallel processing database, data mining, a distributed file system, a distributed database, a cloud computing platform, the Internet and an extensible storage system.

In an actual application scene, a large amount of data may exist, such as a large amount of texts to be classified, a large amount of feature words, category keywords and the like, and relevant steps in the text classification method can be completed based on a big data correlation technology, so that a classification result with higher accuracy can be obtained.

The embodiment of the application can firstly obtain the text to be classified and a word bank for text classification, wherein the word bank comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories, then, the text to be classified is segmented to obtain a plurality of text words and word sequence information of the text words, and then target category keywords existing in the text to be classified are determined according to the text words and the category keywords, then, based on word sequence information of the text words and the target category keywords, determining the target text words associated with the positions of the target category keywords from the texts to be classified, and based on the positive characteristic words and the negative characteristic words, and classifying the target text words corresponding to the target category keywords on a preset category to obtain a classification result of the target category keywords, and finally integrating the classification result of each target category keyword to obtain the category of the text to be classified.

The scheme of the application can determine the target category key words in the texts to be classified through the category key words of the preset categories in the word stock, and determining a target text word associated with the position of the target category keyword in the text to be classified through word sequence information of the text word obtained after the text to be classified is segmented, then classifying the target text words, determining the classification result of each target text word of the target category keywords according to the classification result of each target text word of the target category keywords, finally determining the category of the text to be classified according to the classification result of each target category keyword of the text to be classified, the classification result of the target category keywords is determined through the target text words corresponding to the target category keywords, so that the accuracy of the classification result of text classification can be obviously improved.

The method described in the above embodiments is further illustrated in detail by way of example.

In this embodiment, spam text classification is taken as an example for introduction, the spam text can be a type of text, the spam text can include a plurality of preset categories, such as advertisements, illegal information categories, popular terms and the like, the text classification method of this embodiment can be widely applied to scenes in which spam text recognition is required, such as spam mails, spam barrages, spam short messages and the like, as shown in fig. 3, in the process of playing a video, the barrages can include spam barrages such as "highly imitated sneakers brand is fair + V78666" and "poplar forest carved card hot door color 40 starts WeChat 1111", and after the method is used, barrage contents can be classified, such spam barrages are screened out before the barrage is displayed, so that the viewing experience of a user is greatly improved, as shown in fig. 4, the barrage contents are normal comment information related to the played contents.

In this embodiment, a text classification method is described by taking spam text classification performed by a server as an example, and a flowchart of this embodiment may refer to fig. 5, where:

201. the server receives a text to be classified and loads a word bank for text classification, wherein the word bank comprises a plurality of preset category keywords, positive characteristic words and negative characteristic words.

For example, the predetermined categories include pornography, advertising promotion, and colloquial language.

202. The server divides words of the text to be classified to obtain text words corresponding to the text to be classified and word sequence information of the text words.

203. The server determines a target category keyword in the text to be classified, wherein the target category keyword is a text word which is the same as the category keyword in the text to be classified.

For example, the target category keywords of the pornographs in the text to be classified may be determined as follows: word 1, word 2; the target category keywords of the advertisement promotion category are word 3 and word 4; the target category keywords of the colloquial language category are word 5 and word 6.

204. And the server determines a target text word associated with the target category keyword in the text to be classified according to the word sequence information of the text word.

For example, the target text word corresponding to word 1 (target category keyword) is: word 7, word 8, word 9, word 10.

205. And the server classifies each target text word of the target category keywords respectively according to the positive characteristic words and the negative characteristic words of each preset category to obtain a classification result of each target text word on each preset category.

The classification result may include a positive category and a negative category, for example, the classification result of the advertisement promotion category may include a positive category (representing that the word or text is the advertisement promotion category) and a negative category (representing that the word or text is the non-advertisement promotion category).

For example, the classification results of the word 7 (target text word) corresponding to the word 1 (target category keyword) in pornography, advertisement promotion, and vulgar categories are positive category, negative category, and negative category, respectively.

206. And integrating the classification result of all target text words of the target category keywords on each preset category by the server to obtain the classification result of the target category keywords on each preset category.

For example, integrating all target text words of word 1 (target category keywords) and determining that the classification results of word 1 in pornography category, advertisement promotion category and vulgar language category are positive category, positive category and negative category respectively.

207. And integrating the classification results of all target category keywords of the text to be classified on each preset category by the server to obtain the classification results of the text to be classified on each preset category.

For example, the classification results of all target category keywords (word 1, word 2, word 3, word 4, word 5, and word 6) of the text to be classified in the pornographic category, the advertisement promotion category, and the vulgar category are integrated, and the classification results of the text to be classified in the pornographic category, the advertisement promotion category, and the vulgar category are determined to be the positive category, the negative category, and the negative category, respectively.

208. And the server determines the category of the text to be classified based on the classification result of the text to be classified on each preset category.

For example, the classification result of the text to be classified on the pornographic class, the advertisement promotion class and the vulgar class can determine that the text to be classified is the pornographic junk text.

Referring to fig. 6, according to the present application, a junk text reference word (i.e., a category keyword) may be first constructed, positive and negative samples (i.e., preset positive sample data and negative sample data of a category) of a classification training set are obtained, then positive and negative context feature words (i.e., positive and negative feature words) of the junk word are mined in a frequent sequence mode, then text classification is performed on a text to be classified, specifically, an N-gram window is used as a context feature word (target text word) for matching the junk word (target category keyword), a junk classification polarity (i.e., a classification result) of the window word (i.e., target text word) is calculated by using an SO-PMI, and finally, a text classification category (i.e., a category of the text to be recognized) is obtained by integrating the classification polarity (i.e.

In addition, the embodiment classifies the target text words on a plurality of preset categories (through the positive characteristic words and the negative characteristic words of each preset category), and then can obtain the classification results of the target category keywords on the plurality of preset categories, so that whether the text to be classified belongs to the plurality of preset categories can be judged, the text classification range is wider, and the applicable range of the scheme is wider.

In order to better implement the text classification method provided by the embodiment of the present application, the embodiment of the present application further provides a device based on the text classification method. The meanings of the nouns are the same as those in the text classification method, and specific implementation details can refer to the description in the method embodiment.

As shown in fig. 7, fig. 7 is a schematic structural diagram of a text classification apparatus according to an embodiment of the present application, where the text classification apparatus may include an obtaining module 301, a word segmentation module 302, a first determination module 303, a second determination module 304, a classification module 305, and an integration module 306, where,

the obtaining module 301 is configured to obtain a text to be classified and a lexicon for text classification, where the lexicon includes category keywords, positive feature words, and negative feature words corresponding to preset categories;

the word segmentation module 302 is configured to perform word segmentation on a text to be classified to obtain a plurality of text words and word sequence information of the text words;

the first determining module 303 is configured to determine a target category keyword existing in the text to be classified according to the text word and the category keyword;

a second determining module 304, configured to determine, based on word order information of the text words and the target category keywords, target text words associated with the target category keywords from the text to be classified;

the classification module 305 is configured to classify the target text word corresponding to the target category keyword in a preset category based on the positive characteristic word and the negative characteristic word, so as to obtain a classification result of the target category keyword;

and the integration module 306 is configured to integrate the classification result of each target category keyword to obtain a category of the text to be classified.

In some embodiments of the present application, referring to fig. 8, the classification module 305 includes a classification submodule 3051 and an integration submodule 3052, wherein,

the classification sub-module 3051 is configured to classify, on the basis of the positive characteristic words and the negative characteristic words, each target text word corresponding to the target category keyword in a preset category to obtain a classification result of each target text word;

the integrating submodule 3052 is configured to integrate the classification result of each target text word corresponding to the target category keyword to obtain a classification result of the target category keyword.

the statistical unit is used for respectively counting the occurrence frequency of the positive characteristic words and the negative characteristic words in all the text words to obtain the positive word frequency and the negative word frequency of the target category keywords in the preset category;

the expansion module is used for performing near sense expansion on the category reference words to obtain category keywords of a preset category;

and the processing module is used for processing the sample data based on the category keywords, determining the positive characteristic words and the negative characteristic words of the preset categories and obtaining a word stock, wherein the word stock comprises the category keywords, the positive characteristic words and the negative characteristic words corresponding to the preset categories.

feature word mining is respectively carried out on the positive sample data and the negative sample data based on a preset threshold and category keywords, and positive feature words and negative feature words of a preset category are determined.

In some embodiments of the present application, the lexicon includes category keywords, positive feature words, and negative feature words corresponding to a plurality of preset categories, and the classification module is specifically configured to:

In this embodiment, the obtaining module 301 may first obtain a text to be classified and a word bank for text classification, where the word bank includes category keywords, positive feature words and negative feature words corresponding to preset categories, then the word segmentation module 302 may segment the text to be classified to obtain a plurality of text words and word order information of the text words, the first determining module 303 determines target category keywords existing in the text to be classified according to the text words and the category keywords, then the second determining module 304 determines target text words associated with positions of the target category keywords from the text to be classified based on the word order information of the text words and the target category keywords, the classification module 305 classifies the target text words corresponding to the target category keywords on the preset categories based on the positive feature words and the negative feature words to obtain classification results of the target category keywords, finally, the integration module 306 may integrate the classification result of each target category keyword to obtain the category of the text to be classified.

In addition, an embodiment of the present application further provides a computer device, where the computer device may be a terminal or a server, as shown in fig. 9, which shows a schematic structural diagram of the computer device according to the embodiment of the present application, and specifically:

the computer device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 9 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby monitoring the computer device as a whole. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user pages, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 via a power management system, so that functions of managing charging, discharging, and power consumption are implemented via the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The computer device may also include an input unit 404, the input unit 404 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the computer device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application programs stored in the memory 402, thereby implementing various functions as follows:

acquiring a text to be classified and a word bank for text classification, wherein the word bank comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories; performing word segmentation on a text to be classified to obtain a plurality of text words and word sequence information of the text words; determining target category keywords existing in the text to be classified according to the text words and the category keywords; determining a target text word associated with the target category keyword from the text to be classified based on word sequence information of the text word and the target category keyword; classifying the target text words corresponding to the target category keywords on a preset category based on the positive characteristic words and the negative characteristic words to obtain a classification result of the target category keywords; and integrating the classification result of each target class keyword to obtain the class of the text to be classified.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by a computer program, which may be stored in a computer-readable storage medium and loaded and executed by a processor, or by related hardware controlled by the computer program.

To this end, embodiments of the present application further provide a storage medium, in which a computer program is stored, where the computer program can be loaded by a processor to execute the steps in any one of the text classification methods provided in the embodiments of the present application. For example, the computer program may perform the steps of:

acquiring a text to be classified and a word bank for text classification, wherein the word bank comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories; performing word segmentation on a text to be classified to obtain a plurality of text words and word sequence information of the text words; determining target category keywords existing in the text to be classified according to the text words and the category keywords; determining a target text word associated with the position of the target category keyword from the text to be classified based on word sequence information of the text word and the target category keyword; classifying the target text words corresponding to the target category keywords on a preset category based on the positive characteristic words and the negative characteristic words to obtain a classification result of the target category keywords; and integrating the classification result of each target class keyword to obtain the class of the text to be classified.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the storage medium can execute the steps in any text classification method provided in the embodiments of the present application, the beneficial effects that can be achieved by any text classification method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The text classification method and the text classification device provided by the embodiment of the application are described in detail, a specific example is applied in the text to explain the principle and the implementation of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of text classification, comprising:

2. The method according to claim 1, wherein the classifying the target text words corresponding to the target category keywords on the preset category based on the positive direction feature words and the negative direction feature words to obtain a classification result of the target category keywords comprises:

classifying each target text word corresponding to the target category keywords on the preset category based on the positive characteristic words and the negative characteristic words to obtain a classification result of each target text word;

and integrating the classification result of each target text word corresponding to the target category keywords to obtain the classification result of the target category keywords.

3. The method according to claim 2, wherein the classifying each target text word corresponding to the target category keyword on the preset category based on the positive direction feature word and the negative direction feature word to obtain a classification result of each target text word comprises:

respectively counting the occurrence frequency of the positive characteristic words and the negative characteristic words in all text words to obtain the positive word frequency and the negative word frequency of the target category keywords in the preset category;

and performing classification calculation on each target text word corresponding to the target category keywords based on the positive word frequency and the negative word frequency to obtain a classification result of each target text word.

4. The method of claim 2, wherein the classification result comprises a positive classification and a negative classification, and the integrating the classification result of each target text word corresponding to the target classification keyword to obtain the classification result of the target classification keyword comprises:

respectively counting the target text words with the classification results of positive categories and negative categories to obtain positive quantity and negative quantity;

and determining a classification result of the target category keywords based on the positive quantity and the negative quantity.

5. The method of claim 4, wherein determining the classification result of the target category keyword based on the positive amount and the negative amount comprises:

6. The method of claim 1, further comprising:

acquiring a category reference word and sample data of a preset category;

performing near sense expansion on the category reference words to obtain category keywords of the preset category;

and processing the sample data based on the category keywords, determining positive characteristic words and negative characteristic words of the preset category, and obtaining a word stock, wherein the word stock comprises the category keywords, the positive characteristic words and the negative characteristic words corresponding to the preset category.

7. The method of claim 6, wherein the processing the sample data based on the category keyword to determine positive and negative feature words of the preset category comprises:

8. The method according to claim 1, wherein the thesaurus includes category keywords, positive feature words and negative feature words corresponding to a plurality of preset categories, and the step of classifying the target text words on the preset categories based on the positive feature words and the negative feature words to obtain the classification result of the target category keywords comprises:

the integrating the classification result of each target category keyword to obtain the category of the text to be classified comprises the following steps:

9. The method according to claim 8, wherein each preset category comprises a positive sub-category and a negative sub-category, and the integrating the classification result of each target category keyword on each preset category to determine the category of the text to be classified comprises:

10. A text classification apparatus, comprising: