CN111078878A - Text processing method, device and equipment and computer readable storage medium - Google Patents

Text processing method, device and equipment and computer readable storage medium Download PDF

Info

Publication number
CN111078878A
CN111078878A CN201911239505.2A CN201911239505A CN111078878A CN 111078878 A CN111078878 A CN 111078878A CN 201911239505 A CN201911239505 A CN 201911239505A CN 111078878 A CN111078878 A CN 111078878A
Authority
CN
China
Prior art keywords
text
classified
classifier
information
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911239505.2A
Other languages
Chinese (zh)
Other versions
CN111078878B (en
Inventor
石逸轩
戴明洋
潘剑飞
周俊
罗程亮
许金泉
姚远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201911239505.2A priority Critical patent/CN111078878B/en
Publication of CN111078878A publication Critical patent/CN111078878A/en
Application granted granted Critical
Publication of CN111078878B publication Critical patent/CN111078878B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure discloses a text processing method, a text processing device, text processing equipment and a computer readable storage medium, and relates to the field of text processing. The specific implementation scheme is as follows: acquiring data to be classified input by terminal equipment, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified; acquiring user characteristics corresponding to the user according to the user identification, and performing vectorization processing on the text to be classified and the user characteristics to acquire vector information to be processed; processing the vector information to be processed by adopting a preset feature extraction model to obtain feature information corresponding to the vector information to be processed; and classifying the characteristic information through a cascade classifier to obtain class information corresponding to the text to be classified. Therefore, the factors of the user characteristics can be considered in the classification process, and the accuracy of text classification is improved.

Description

Text processing method, device and equipment and computer readable storage medium
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a text processing technique.
Background
When analyzing the content generated by the user, a class of problems is often encountered, and hierarchical topic classification needs to be performed on the text content generated by the user. In practical application, the task is applied to many business scenes, such as sub-category, question answering, advertisement putting, search result organization and the like.
In order to classify content data, a classification tree structure is generally constructed in advance in the prior art, different classification models are respectively constructed for leaf nodes of the tree structure, and each classification model is adopted to classify the content data.
However, the text content generated by the user generally has a large difference from the natural language, the used language is random, and the Out Of Vocab phenomenon is serious, so the text content is more dependent on the user information. Therefore, when content data is classified by the above method, such content data cannot be classified accurately.
Disclosure of Invention
The present disclosure provides a text processing method, a text processing apparatus, a text processing device, and a computer readable storage medium, which are used to solve the problem that content data cannot be classified accurately when the content data is classified by using the existing text processing method.
In a first aspect, an embodiment of the present disclosure provides a text processing method, including:
acquiring data to be classified input by terminal equipment, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified;
acquiring user characteristics corresponding to the user according to the user identification, and performing vectorization processing on the text to be classified and the user characteristics to acquire vector information to be processed;
processing the vector information to be processed by adopting a preset feature extraction model to obtain feature information corresponding to the vector information to be processed;
and classifying the characteristic information through a cascade classifier to obtain class information corresponding to the text to be classified.
In the text processing method provided by the embodiment, the user features used for representing the familiar features of the user when the user publishes the text information are added in the feature extraction process, so that the factors of the user features can be considered in the classification process, and the accuracy of text classification is improved.
In a possible design, after acquiring the data to be classified input by the terminal device, the method further includes:
performing word segmentation, punctuation removal and coding treatment on the text to be classified to obtain a preprocessed text to be classified;
correspondingly, the vectorizing processing of the text to be classified and the user features includes:
and vectorizing the preprocessed text to be classified and the user characteristics.
In the text processing method provided by the embodiment, the user features used for representing the familiar features of the user when the user publishes the text information are added in the feature extraction process, so that the factors of the user features can be considered in the classification process, and the accuracy of text classification is improved.
In one possible design, the vectorizing the text to be classified and the user feature includes:
and vectorizing the text to be classified and the user characteristics through Embedding.
According to the text processing method provided by the embodiment, the text to be classified and the user characteristics are subjected to vectorization processing by adopting an Embedding mode, so that the basic granularity vector representation of the text to be classified can be accurately obtained.
In one possible design, the cascade classifier includes a plurality of layers of classifiers, and the classifying operation performed on the feature information by the cascade classifier includes:
and sequentially inputting the characteristic information and the classification result output by the last layer of classifier into the next layer of classifier, and taking the result output by the last layer of classifier as the classification information corresponding to the text to be classified.
According to the text processing method provided by the embodiment, the output result of the classifier at the upper layer and the feature information are input into the classifier at the lower layer together, so that the subclass of the classifier at the lower layer under the classification result can perform the operation of classifying the feature information again, and the classification efficiency and the classification accuracy are effectively improved.
In a possible design, the sequentially inputting the feature information and the classification result output by the previous classifier into the next classifier, and taking the result output by the last classifier as the classification information corresponding to the text to be classified includes:
inputting the feature information into a preset first-layer classifier to obtain a first class identifier corresponding to the feature information;
inputting the feature information and the first category identification to a preset second-layer classifier, wherein the second classifier is used for classifying the feature information under the subcategory of the first category identification to obtain a second category identification corresponding to the feature information, and associating the first category identification and the second category identification to obtain a target category identification;
and judging whether other sub-categories are included under the second category identification, if so, inputting the target category identification and the feature information into a next-layer classifier for classification operation until the category information output by the classifier does not include other sub-categories.
According to the text processing method provided by the embodiment, the output result of the classifier at the upper layer and the feature information are input into the classifier at the lower layer together, so that the subclass of the classifier at the lower layer under the classification result can perform the operation of classifying the feature information again, and the classification efficiency and the classification accuracy are effectively improved.
In a possible design, after the classifying operation is performed on the feature information through the cascade classifier to obtain the category information corresponding to the text to be classified, the method further includes:
and storing the text to be classified into a storage path corresponding to the category information according to the category information corresponding to the text to be classified.
According to the text processing method provided by the embodiment, the text to be classified is stored into the storage path corresponding to the category information according to the category information corresponding to the text to be classified, so that the text to be classified can be conveniently applied after being classified.
In a second aspect, an embodiment of the present disclosure provides a text processing apparatus, including:
the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring data to be classified input by terminal equipment, and the data to be classified comprises a text to be classified and an identification of a user inputting the text to be classified;
the vectorization processing module is used for acquiring user characteristics corresponding to the user according to the user identification, and carrying out vectorization processing on the text to be classified and the user characteristics to acquire vector information to be processed;
the characteristic extraction module is used for processing the vector information to be processed by adopting a preset characteristic extraction model to obtain characteristic information corresponding to the vector information to be processed;
and the classification module is used for performing classification operation on the characteristic information through a cascade classifier to obtain the class information corresponding to the text to be classified.
In one possible design, the apparatus further includes:
the preprocessing module is used for performing word segmentation, punctuation removal and coding processing on the text to be classified to obtain a preprocessed text to be classified;
accordingly, the vectorization processing module is configured to:
and vectorizing the preprocessed text to be classified and the user characteristics.
In one possible design, the vectorization processing module is to:
and vectorizing the text to be classified and the user characteristics through Embedding.
In one possible design, the cascade of classifiers includes a multi-layer classifier, and the classification module is configured to:
and sequentially inputting the characteristic information and the classification result output by the last layer of classifier into the next layer of classifier, and taking the result output by the last layer of classifier as the classification information corresponding to the text to be classified.
In one possible design, the classification module is to:
inputting the feature information into a preset first-layer classifier to obtain a first class identifier corresponding to the feature information;
inputting the feature information and the first category identification to a preset second-layer classifier, wherein the second classifier is used for classifying the feature information under the subcategory of the first category identification to obtain a second category identification corresponding to the feature information, and associating the first category identification and the second category identification to obtain a target category identification;
and judging whether other sub-categories are included under the second category identification, if so, inputting the target category identification and the feature information into a next-layer classifier for classification operation until the category information output by the classifier does not include other sub-categories.
In one possible design, the apparatus further includes:
and the processing module is used for storing the texts to be classified into storage paths corresponding to the category information according to the category information corresponding to the texts to be classified.
In a third aspect, an embodiment of the present disclosure provides a text processing apparatus, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
In a fourth aspect, the disclosed embodiments provide a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first aspect.
In a fifth aspect, an embodiment of the present disclosure provides a text processing method, including:
acquiring data to be classified, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified;
acquiring user characteristics corresponding to the user according to the user identification, and performing vectorization processing on the text to be classified and the user characteristics to acquire vector information to be processed;
processing the vector information to be processed to obtain characteristic information corresponding to the vector information to be processed;
and carrying out classification operation on the characteristic information to obtain class information corresponding to the text to be classified.
In the text processing method, the text processing device, the text processing equipment and the computer-readable storage medium provided by the embodiment, the user characteristics used for representing the familiar characteristics of the user when the user publishes the text information are added in the characteristic extraction process, so that the factors of the user characteristics can be considered in the classification process, and the accuracy of text classification is improved.
Other effects of the above-described alternative will be described below with reference to specific embodiments.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a diagram of a system architecture upon which the present disclosure is based;
fig. 2 is a schematic flowchart of a text processing method according to a first embodiment of the disclosure;
FIG. 3 is a category organization structure provided by embodiments of the present disclosure;
fig. 4 is a schematic flowchart of a text processing method according to a second embodiment of the disclosure;
fig. 5 is a schematic structural diagram of a text processing apparatus according to a third embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a text processing apparatus according to a fourth embodiment of the present disclosure;
fig. 7 is a schematic flowchart of a text processing method according to a fifth embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In order to solve the problem that the existing text processing method cannot accurately classify the content data when the content data is classified, the disclosure provides a text processing method, a text processing device and a computer readable storage medium. Because the existing text processing method does not consider the personalized information of the user, the classification result is inaccurate, and therefore, in order to improve the accuracy of the classification result, the user feature information can be added in the feature extraction process.
It should be noted that the text processing method, device, apparatus, and computer-readable storage medium provided in the present disclosure can be applied to any scene for classifying texts.
Fig. 1 is a system architecture diagram based on the present disclosure, and as shown in fig. 1, the system architecture diagram based on the present disclosure at least includes a plurality of terminal devices 1 and a text processing apparatus 2, where the text processing apparatus 2 is written in languages such as C/C + +, Java, Shell, or Python; the terminal device 1 may be a desktop computer, a tablet computer, or the like. The terminal device 1 is connected with the text processing device 2 in a communication mode, so that information interaction with the terminal device can be carried out.
Fig. 2 is a schematic flowchart of a text processing method according to a first embodiment of the present disclosure, and as shown in fig. 2, the method includes:
step 101, obtaining data to be classified input by terminal equipment, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified.
The execution main body of the embodiment is a text processing device, and the text processing device is in communication connection with the terminal equipment, so that information interaction can be carried out with the terminal equipment. The terminal equipment can obtain the data to be classified which needs to be classified. Specifically, the user can publish text content on the terminal device, and accordingly, after receiving the text content, the terminal device can send the text content to the text processing device in real time for classification processing; optionally, the text processing apparatus may also obtain the text content published by the user from the terminal device according to a preset period, and perform a classification operation on the text content. Accordingly, the text processing apparatus can acquire data to be classified from the terminal device.
It should be noted that, because the content of the text generated by the user generally has a larger difference from the natural language, the language used is more random and depends on the user information, and therefore, in order to improve the accuracy of the classification of the text to be classified, the data to be classified may also carry the identifier of the user who published the text to be classified, in addition to the text to be classified.
Step 102, obtaining user characteristics corresponding to the user according to the user identification, and performing vectorization processing on the text to be classified and the user characteristics to obtain vector information to be processed.
In this embodiment, in order to obtain the user characteristics, a database including a large number of user characteristics may be established in advance, where the user characteristics can represent the familiar characteristics of the user when publishing text information. Accordingly, after the user identifier is obtained, the user feature corresponding to the user may be obtained from the database according to the user identifier.
After the text to be classified and the user features are obtained, feature extraction operation can be carried out on the text to be classified and the user features. Correspondingly, before feature extraction, for convenience of model processing, vectorization processing may be performed on the text to be classified and the user features to obtain the text to be classified and the vector information to be processed corresponding to the user features.
Specifically, on the basis of the foregoing embodiment, the step 102 specifically includes:
and vectorizing the text to be classified and the user characteristics through Embedding.
In this embodiment, vectorization processing may be specifically performed on the text to be classified and the user features by using Embedding, so as to obtain a basic granularity vector representation of the text to be classified. The basic granularity may be word granularity or word granularity. The word vector corresponding to each word group can be obtained by performing word segmentation on the text to be classified and performing vectorization on each word group after word segmentation, or the word vector can be obtained by directly performing vectorization on the text to be classified without performing word segmentation. The present disclosure is not so limited.
Step 103, processing the vector information to be processed by adopting a preset feature extraction model, and obtaining feature information corresponding to the vector information to be processed.
In this embodiment, after obtaining the text to be classified and the vector information to be processed corresponding to the user feature, the feature information of the vector information to be processed may be obtained. Specifically, the vector information to be processed may be processed by using a preset feature extraction model, so as to obtain feature information corresponding to the vector information to be processed. It should be noted that any feature extraction model capable of performing feature extraction may be used to process vector information to be processed, such as CNN, RNN, LSTM, transform, and the like, which is not limited by this disclosure.
As an implementable manner, since different network models have different advantages in task processing, after receiving data to be classified, the features of the text to be classified can be judged first, and for different features, different network models are adopted for feature extraction. For example, CNN is good at extracting the textual relationships of adjacent windows; the LSTM can obtain the dependency information in the long sentence text; the Transformer is suitable for the task of Seq2Seq, and the BERT model adopts a bidirectional Transformer structure to make breakthrough progress on a plurality of NLP tasks.
And 104, classifying the feature information through a cascade classifier to obtain class information corresponding to the text to be classified.
In this embodiment, fig. 3 is a category organization structure provided in the embodiment of the present disclosure, as shown in fig. 3, the classification process includes a plurality of different category hierarchies, such as secondary classification (digital, internet, mathematical, physical, etc.), tertiary classification (television, mobile phone, etc.), and quaternary classification (full screen mobile phone, non-full screen mobile phone, etc.) in science and technology. Therefore, in order to realize accurate classification of the feature information, a preset cascade classifier can be adopted to perform classification operation on the feature information, so as to obtain class information corresponding to the text to be classified.
Further, on the basis of any of the above embodiments, after the step 104, the method further includes:
and storing the text to be classified into a storage path corresponding to the category information according to the category information corresponding to the text to be classified.
In this embodiment, after the text to be classified is classified, the classified text may be stored in a storage path corresponding to the category information. Therefore, when the text information corresponding to a certain category is called subsequently, all the text information can be directly acquired from the storage path corresponding to the category.
In the text processing method provided by the embodiment, the user features used for representing the familiar features of the user when the user publishes the text information are added in the feature extraction process, so that the factors of the user features can be considered in the classification process, and the accuracy of text classification is improved.
Further, on the basis of any of the above embodiments, after the step 101, the method further includes:
performing word segmentation, punctuation removal and coding treatment on the text to be classified to obtain a preprocessed text to be classified;
correspondingly, step 102 specifically includes:
and vectorizing the preprocessed text to be classified and the user characteristics.
In this embodiment, in order to improve the classification efficiency of the text to be classified, before performing the classification operation on the text to be classified, the text to be classified may be preprocessed first. Specifically, the text to be classified may be subjected to word segmentation, punctuation removal, coding, and the like, so as to obtain the preprocessed text to be classified. Correspondingly, the user characteristics and the preprocessed text to be classified can be subsequently subjected to vectorization processing, and vector information to be processed is obtained.
According to the text processing method provided by the embodiment, before the text to be classified is classified, the text to be classified is preprocessed, so that useless characters and the like in the text to be classified can be removed, and the classification efficiency of the text to be classified is improved.
Further, on the basis of any of the above embodiments, the cascade classifier includes a multi-layer classifier, and step 104 specifically includes:
and sequentially inputting the characteristic information and the classification result output by the last layer of classifier into the next layer of classifier, and taking the result output by the last layer of classifier as the classification information corresponding to the text to be classified.
In this embodiment, since the classification process includes a plurality of different category hierarchies, a cascade classifier including a plurality of layers of classifiers may be used to classify the feature information. Specifically, in order to improve the classification efficiency and the classification accuracy, the classification result output by the previous layer may be input to the next-layer classifier together with the feature information, so that the next-layer classifier can perform the operation of classifying the feature information again in the sub-category of the classification result. For example, if the classification result output by the first-level classifier is "technology", the technology label and the feature information may be input to the next-level classifier, and accordingly, the next-level classifier may classify the feature information in a plurality of sub-categories of "digital, internet, mathematics, physics, etc" under technology. And executing the steps aiming at each layer of classifier, and taking the classification result output by the last layer of classifier as the classification information corresponding to the text to be classified. If the current classifier is the first classifier in the cascade classifier, only the feature information can be input into the classifier; if the current classifier is the nth classifier in the cascade classifier, the classification result and the feature information of the previous classifier can be input into the classifier.
It should be noted that, in the prior art, a classification tree structure is generally constructed in advance, different classification models are respectively constructed for leaf nodes of the tree structure, and each classification model is adopted to classify content data. However, the classification of content data by the above method depends on training a model for each layer and each sub-category to solve the sub-category classification problem. If the topic tree structure is deep, it is difficult to train enough models covering each sub-category, which seriously affects the classification efficiency, and the text processing method provided by this embodiment does not need to train a network model for each sub-category, so that on one hand, all types of text information can be covered, and on the other hand, the classification efficiency can be improved.
According to the text processing method provided by the embodiment, the output result of the classifier at the upper layer and the feature information are input into the classifier at the lower layer together, so that the subclass of the classifier at the lower layer under the classification result can perform the operation of classifying the feature information again, and the classification efficiency and the classification accuracy are effectively improved.
Fig. 4 is a schematic flow chart of a text processing method according to a second embodiment of the present disclosure, where on the basis of any of the above embodiments, as shown in fig. 4, the sequentially inputting the feature information and the classification result output by the previous classifier into the next classifier, and taking the result output by the last classifier as the classification information corresponding to the text to be classified includes:
step 201, inputting the feature information into a preset first-layer classifier, and obtaining a first class identifier corresponding to the feature information;
step 202, inputting the feature information and the first category identifier into a preset second-layer classifier, where the second classifier is used for performing classification operation on the feature information in a sub-category under the first category identifier to obtain a second category identifier corresponding to the feature information, and associating the first category identifier and the second category identifier to obtain a target category identifier;
step 203, judging whether other sub-categories are included under the second category identification, if so, inputting the target category identification and the feature information into a next-layer classifier for classification operation until the category information output by the classifier does not include other sub-categories.
In this embodiment, after the feature information is acquired, the feature information may be input into a preset first-layer classifier, a first category identifier corresponding to the feature information is acquired, and the first category identifier and the feature information are input into a second-layer classifier, so that the second-layer classifier can perform a classification operation on the feature information in a plurality of sub-categories under the first category identifier, and acquire a second category identifier corresponding to the feature information. And associating the first category identification with the second category identification to obtain a target category identification. And determining whether other sub-categories are included under the second category identification, if so, continuing to perform classification operation on the feature information by using a subsequent classifier, and if not, taking the second category identification as category information corresponding to the text to be classified.
For example, in practical applications, the category information output by the first-layer classifier is technology, and has a plurality of subcategories, then the "technology" label and the feature information can be input into the second-layer classifier together, the second classifier classifies the feature information in a plurality of subcategories "digital, internet, mathematics and physics" under the technology category to obtain a classification result "digital", the feature information continues to be classified in a plurality of subcategories under the "digital" category to obtain a classification result "mobile phone", the classification operation continues to be performed on a plurality of subcategories under the "mobile phone" category to obtain a final output classification result, and the mobile phone is fully-screened. Correspondingly, a plurality of category identifications are correlated to obtain final characteristic information of science and technology-digital-mobile phone-full screen mobile phone.
According to the text processing method provided by the embodiment, the output result of the classifier at the upper layer and the feature information are input into the classifier at the lower layer together, so that the subclass of the classifier at the lower layer under the classification result can perform the operation of classifying the feature information again, and the classification efficiency and the classification accuracy are effectively improved.
Fig. 5 is a schematic structural diagram of a text processing apparatus according to a third embodiment of the present disclosure, and as shown in fig. 5, the text processing apparatus 30 includes: an acquisition module 31, a vectorization processing module 32, a feature extraction module 33, and a classification module 34. The acquiring module 31 is configured to acquire data to be classified input by a terminal device, where the data to be classified includes a text to be classified and an identifier of a user who inputs the text to be classified; the vectorization processing module 32 is configured to obtain a user feature corresponding to the user according to the user identifier, perform vectorization processing on the text to be classified and the user feature, and obtain vector information to be processed; the feature extraction module 33 is configured to process the vector information to be processed by using a preset feature extraction model, and obtain feature information corresponding to the vector information to be processed; and the classification module is used for performing classification operation on the characteristic information through a cascade classifier to obtain the class information corresponding to the text to be classified.
Further, on the basis of the third embodiment, the apparatus further includes:
the preprocessing module is used for performing word segmentation, punctuation removal and coding processing on the text to be classified to obtain a preprocessed text to be classified;
accordingly, the vectorization processing module is configured to:
and vectorizing the preprocessed text to be classified and the user characteristics.
Further, on the basis of any of the above embodiments, the vectorization processing module is configured to:
and vectorizing the text to be classified and the user characteristics through Embedding.
Further, on the basis of any of the above embodiments, the cascade classifier includes a multi-layer classifier, and the classification module is configured to:
and sequentially inputting the characteristic information and the classification result output by the last layer of classifier into the next layer of classifier, and taking the result output by the last layer of classifier as the classification information corresponding to the text to be classified.
Further, on the basis of any of the above embodiments, the classification module is configured to:
inputting the feature information into a preset first-layer classifier to obtain a first class identifier corresponding to the feature information;
inputting the feature information and the first category identification to a preset second-layer classifier, wherein the second classifier is used for classifying the feature information under the subcategory of the first category identification to obtain a second category identification corresponding to the feature information, and associating the first category identification and the second category identification to obtain a target category identification;
and judging whether other sub-categories are included under the second category identification, if so, inputting the target category identification and the feature information into a next-layer classifier for classification operation until the category information output by the classifier does not include other sub-categories.
Further, on the basis of any one of the above embodiments, the apparatus further includes:
and the processing module is used for storing the texts to be classified into storage paths corresponding to the category information according to the category information corresponding to the texts to be classified.
The present disclosure also provides a text processing apparatus and a readable storage medium according to an embodiment of the present disclosure.
Fig. 6 is a schematic structural diagram of a text processing apparatus provided in a fourth embodiment of the present disclosure, and as shown in fig. 6, is a block diagram of a text processing apparatus of a text processing method according to the fourth embodiment of the present disclosure. Text processing apparatus is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The text processing device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the text processing apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the text processing device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple text processing devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.
The memory 602 is a non-transitory computer readable storage medium provided by the present disclosure. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the text processing methods provided by the present disclosure. The non-transitory computer-readable storage medium of the present disclosure stores computer instructions for causing a computer to execute the text processing method provided by the present disclosure.
The memory 602, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the text processing method in the embodiments of the present disclosure (e.g., the acquisition module 31, the vectorization processing module 32, the feature extraction module 33, and the classification module 34 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing, i.e., implements the text processing method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.
The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a text processing apparatus for text processing, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 602 optionally includes memory located remotely from processor 601, which may be connected to a text processing device for text processing via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The text processing apparatus of the text processing method may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.
The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the text processing apparatus, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
Fig. 7 is a schematic flowchart of a text processing method provided in the fifth embodiment of the present disclosure, and as shown in fig. 5, the method includes:
501, acquiring data to be classified, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified;
step 502, obtaining user characteristics corresponding to the user according to the user identification, and performing vectorization processing on the text to be classified and the user characteristics to obtain vector information to be processed;
step 503, processing the vector information to be processed to obtain feature information corresponding to the vector information to be processed;
and step 504, classifying the characteristic information to obtain class information corresponding to the text to be classified.
In the text processing method provided by the embodiment, the user features used for representing the familiar features of the user when the user publishes the text information are added in the feature extraction process, so that the factors of the user features can be considered in the classification process, and the accuracy of text classification is improved.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (15)

1. A method of text processing, comprising:
acquiring data to be classified input by terminal equipment, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified;
acquiring user characteristics corresponding to the user according to the user identification, and performing vectorization processing on the text to be classified and the user characteristics to acquire vector information to be processed;
processing the vector information to be processed by adopting a preset feature extraction model to obtain feature information corresponding to the vector information to be processed;
and classifying the characteristic information through a cascade classifier to obtain class information corresponding to the text to be classified.
2. The method according to claim 1, wherein after acquiring the data to be classified input by the terminal device, the method further comprises:
performing word segmentation, punctuation removal and coding treatment on the text to be classified to obtain a preprocessed text to be classified;
correspondingly, the vectorizing processing of the text to be classified and the user features includes:
and vectorizing the preprocessed text to be classified and the user characteristics.
3. The method according to claim 1, wherein the vectorizing the text to be classified and the user features comprises:
and vectorizing the text to be classified and the user characteristics through Embedding.
4. The method according to any one of claims 1-3, wherein the cascade of classifiers includes a plurality of layers of classifiers, and the classifying operation performed on the feature information by the cascade of classifiers includes:
and sequentially inputting the characteristic information and the classification result output by the last layer of classifier into the next layer of classifier, and taking the result output by the last layer of classifier as the classification information corresponding to the text to be classified.
5. The method according to claim 4, wherein the sequentially inputting the feature information and the classification result output by the previous classifier into the next classifier, and taking the result output by the last classifier as the category information corresponding to the text to be classified comprises:
inputting the feature information into a preset first-layer classifier to obtain a first class identifier corresponding to the feature information;
inputting the feature information and the first category identification to a preset second-layer classifier, wherein the second classifier is used for classifying the feature information under the subcategory of the first category identification to obtain a second category identification corresponding to the feature information, and associating the first category identification and the second category identification to obtain a target category identification;
and judging whether other sub-categories are included under the second category identification, if so, inputting the target category identification and the feature information into a next-layer classifier for classification operation until the category information output by the classifier does not include other sub-categories.
6. The method according to any one of claims 1 to 3, wherein after the classifying operation is performed on the feature information through the cascade classifier to obtain the category information corresponding to the text to be classified, the method further comprises:
and storing the text to be classified into a storage path corresponding to the category information according to the category information corresponding to the text to be classified.
7. A text processing apparatus, comprising:
the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring data to be classified input by terminal equipment, and the data to be classified comprises a text to be classified and an identification of a user inputting the text to be classified;
the vectorization processing module is used for acquiring user characteristics corresponding to the user according to the user identification, and carrying out vectorization processing on the text to be classified and the user characteristics to acquire vector information to be processed;
the characteristic extraction module is used for processing the vector information to be processed by adopting a preset characteristic extraction model to obtain characteristic information corresponding to the vector information to be processed;
and the classification module is used for performing classification operation on the characteristic information through a cascade classifier to obtain the class information corresponding to the text to be classified.
8. The apparatus of claim 7, further comprising:
the preprocessing module is used for performing word segmentation, punctuation removal and coding processing on the text to be classified to obtain a preprocessed text to be classified;
accordingly, the vectorization processing module is configured to:
and vectorizing the preprocessed text to be classified and the user characteristics.
9. The apparatus of claim 7, wherein the vectorization processing module is configured to:
and vectorizing the text to be classified and the user characteristics through Embedding.
10. The apparatus of any one of claims 7-9, wherein the cascade of classifiers comprises a plurality of layers of classifiers, and wherein the classification module is configured to:
and sequentially inputting the characteristic information and the classification result output by the last layer of classifier into the next layer of classifier, and taking the result output by the last layer of classifier as the classification information corresponding to the text to be classified.
11. The apparatus of claim 10, wherein the classification module is configured to:
inputting the feature information into a preset first-layer classifier to obtain a first class identifier corresponding to the feature information;
inputting the feature information and the first category identification to a preset second-layer classifier, wherein the second classifier is used for classifying the feature information under the subcategory of the first category identification to obtain a second category identification corresponding to the feature information, and associating the first category identification and the second category identification to obtain a target category identification;
and judging whether other sub-categories are included under the second category identification, if so, inputting the target category identification and the feature information into a next-layer classifier for classification operation until the category information output by the classifier does not include other sub-categories.
12. The apparatus according to any one of claims 7-9, further comprising:
and the processing module is used for storing the texts to be classified into storage paths corresponding to the category information according to the category information corresponding to the texts to be classified.
13. A text processing apparatus characterized by comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.
15. A method of text processing, comprising:
acquiring data to be classified, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified;
acquiring user characteristics corresponding to the user according to the user identification, and performing vectorization processing on the text to be classified and the user characteristics to acquire vector information to be processed;
processing the vector information to be processed to obtain characteristic information corresponding to the vector information to be processed;
and carrying out classification operation on the characteristic information to obtain class information corresponding to the text to be classified.
CN201911239505.2A 2019-12-06 2019-12-06 Text processing method, device, equipment and computer readable storage medium Active CN111078878B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911239505.2A CN111078878B (en) 2019-12-06 2019-12-06 Text processing method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911239505.2A CN111078878B (en) 2019-12-06 2019-12-06 Text processing method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111078878A true CN111078878A (en) 2020-04-28
CN111078878B CN111078878B (en) 2023-07-04

Family

ID=70313132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911239505.2A Active CN111078878B (en) 2019-12-06 2019-12-06 Text processing method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111078878B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581381A (en) * 2020-04-29 2020-08-25 北京字节跳动网络技术有限公司 Method and device for generating training set of text classification model and electronic equipment
CN112257432A (en) * 2020-11-02 2021-01-22 北京淇瑀信息科技有限公司 Self-adaptive intention identification method and device and electronic equipment
CN112487295A (en) * 2020-12-04 2021-03-12 ***通信集团江苏有限公司 5G package pushing method and device, electronic equipment and computer storage medium
CN112802568A (en) * 2021-02-03 2021-05-14 紫东信息科技(苏州)有限公司 Multi-label stomach disease classification method and device based on medical history text
CN113139033A (en) * 2021-05-13 2021-07-20 平安国际智慧城市科技股份有限公司 Text processing method, device, equipment and storage medium
CN113535951A (en) * 2021-06-21 2021-10-22 深圳大学 Method, device, terminal equipment and storage medium for information classification

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020107842A1 (en) * 2001-02-07 2002-08-08 International Business Machines Corporation Customer self service system for resource search and selection
CN101853250A (en) * 2009-04-03 2010-10-06 华为技术有限公司 Method and device for classifying documents
JP2011154586A (en) * 2010-01-28 2011-08-11 Rakuten Inc Apparatus and method for analyzing posting text, and program for posting text analysis apparatus
CN102684997A (en) * 2012-04-13 2012-09-19 亿赞普(北京)科技有限公司 Classification method, classification device, training method and training device of communication messages
CN103106211A (en) * 2011-11-11 2013-05-15 ***通信集团广东有限公司 Emotion recognition method and emotion recognition device for customer consultation texts
CN108416616A (en) * 2018-02-05 2018-08-17 阿里巴巴集团控股有限公司 The sort method and device of complaints and denunciation classification
CN109543032A (en) * 2018-10-26 2019-03-29 平安科技(深圳)有限公司 File classification method, device, computer equipment and storage medium
CN110147445A (en) * 2019-04-09 2019-08-20 平安科技(深圳)有限公司 Intension recognizing method, device, equipment and storage medium based on text classification
CN110309306A (en) * 2019-06-19 2019-10-08 淮阴工学院 A kind of Document Modeling classification method based on WSD level memory network
CN110334216A (en) * 2019-07-12 2019-10-15 福建省趋普物联科技有限公司 A kind of rubbish text recognition methods and system
CN110503054A (en) * 2019-08-27 2019-11-26 广东工业大学 The processing method and processing device of text image

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020107842A1 (en) * 2001-02-07 2002-08-08 International Business Machines Corporation Customer self service system for resource search and selection
CN101853250A (en) * 2009-04-03 2010-10-06 华为技术有限公司 Method and device for classifying documents
JP2011154586A (en) * 2010-01-28 2011-08-11 Rakuten Inc Apparatus and method for analyzing posting text, and program for posting text analysis apparatus
CN103106211A (en) * 2011-11-11 2013-05-15 ***通信集团广东有限公司 Emotion recognition method and emotion recognition device for customer consultation texts
CN102684997A (en) * 2012-04-13 2012-09-19 亿赞普(北京)科技有限公司 Classification method, classification device, training method and training device of communication messages
CN108416616A (en) * 2018-02-05 2018-08-17 阿里巴巴集团控股有限公司 The sort method and device of complaints and denunciation classification
CN109543032A (en) * 2018-10-26 2019-03-29 平安科技(深圳)有限公司 File classification method, device, computer equipment and storage medium
CN110147445A (en) * 2019-04-09 2019-08-20 平安科技(深圳)有限公司 Intension recognizing method, device, equipment and storage medium based on text classification
CN110309306A (en) * 2019-06-19 2019-10-08 淮阴工学院 A kind of Document Modeling classification method based on WSD level memory network
CN110334216A (en) * 2019-07-12 2019-10-15 福建省趋普物联科技有限公司 A kind of rubbish text recognition methods and system
CN110503054A (en) * 2019-08-27 2019-11-26 广东工业大学 The processing method and processing device of text image

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CAI-ZHI LIU等: ""Research of Text Classification Based on Improved TF-IDF Algorithm"", 《2018 IEEE INTERNATIONAL CONFERENCE OF INTELLIGENT ROBOTIC AND CONTROL ENGINEERING (IRCE)》 *
凤丽洲: ""文本分类关键技术及应用研究"", 《中国博士学位论文全文数据库》 *
周浩波等: "多维度解读签名档类型的文本分析――网络传播中即时通讯工具签名档研究初探之二", 《新闻传播》, no. 04 *
王静: ""基于双向门控循环单元的评论文本情感分类"", 《中国优秀博硕士学位论文全文数据库》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581381A (en) * 2020-04-29 2020-08-25 北京字节跳动网络技术有限公司 Method and device for generating training set of text classification model and electronic equipment
CN111581381B (en) * 2020-04-29 2023-10-10 北京字节跳动网络技术有限公司 Method and device for generating training set of text classification model and electronic equipment
CN112257432A (en) * 2020-11-02 2021-01-22 北京淇瑀信息科技有限公司 Self-adaptive intention identification method and device and electronic equipment
CN112487295A (en) * 2020-12-04 2021-03-12 ***通信集团江苏有限公司 5G package pushing method and device, electronic equipment and computer storage medium
CN112802568A (en) * 2021-02-03 2021-05-14 紫东信息科技(苏州)有限公司 Multi-label stomach disease classification method and device based on medical history text
CN113139033A (en) * 2021-05-13 2021-07-20 平安国际智慧城市科技股份有限公司 Text processing method, device, equipment and storage medium
CN113535951A (en) * 2021-06-21 2021-10-22 深圳大学 Method, device, terminal equipment and storage medium for information classification

Also Published As

Publication number Publication date
CN111078878B (en) 2023-07-04

Similar Documents

Publication Publication Date Title
CN112560912B (en) Classification model training method and device, electronic equipment and storage medium
CN111078878B (en) Text processing method, device, equipment and computer readable storage medium
CN111967268A (en) Method and device for extracting events in text, electronic equipment and storage medium
CN111221984A (en) Multimodal content processing method, device, equipment and storage medium
CN111104514B (en) Training method and device for document tag model
CN111709247A (en) Data set processing method and device, electronic equipment and storage medium
CN110991427B (en) Emotion recognition method and device for video and computer equipment
CN111325020A (en) Event argument extraction method and device and electronic equipment
CN112036509A (en) Method and apparatus for training image recognition models
CN111859951A (en) Language model training method and device, electronic equipment and readable storage medium
CN110674314A (en) Sentence recognition method and device
CN111259671A (en) Semantic description processing method, device and equipment for text entity
CN111967256A (en) Event relation generation method and device, electronic equipment and storage medium
CN111680517A (en) Method, apparatus, device and storage medium for training a model
CN111144108A (en) Emotion tendency analysis model modeling method and device and electronic equipment
CN111858905B (en) Model training method, information identification device, electronic equipment and storage medium
CN111859982A (en) Language model training method and device, electronic equipment and readable storage medium
CN110532487B (en) Label generation method and device
CN111522944A (en) Method, apparatus, device and storage medium for outputting information
CN111582477A (en) Training method and device of neural network model
CN112380847A (en) Interest point processing method and device, electronic equipment and storage medium
CN111611990A (en) Method and device for identifying table in image
CN111241234A (en) Text classification method and device
CN111241810A (en) Punctuation prediction method and device
CN111127191A (en) Risk assessment method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant