CN110597991A

CN110597991A - Text classification method and device, computer equipment and storage medium

Info

Publication number: CN110597991A
Application number: CN201910853548.3A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-10
Filing date: 2019-09-10
Publication date: 2019-12-20
Anticipated expiration: 2039-09-10
Also published as: CN110597991B

Abstract

The application relates to a text classification method, and relates to the technical field of natural language processing. The method comprises the following steps: generating a long text containing at least two texts to be classified; processing the long text through a self-attention submodel to obtain a fused word vector of each word in the long text, wherein the self-attention submodel is used for fusing the incidence relation among the words in the original word vector of each word; and processing the fused word vector of each word in the long text through the output sub-model to obtain the classification results of at least two texts to be classified. According to the scheme, word association relation fusion is carried out between different texts to be recognized in an artificial intelligence scene based on multi-text classification, and in the process of classifying through an output sub-model, text classification can be carried out by combining association relations between the texts to be classified, so that the information basis of text classification is expanded, and the accuracy of multi-text classification is improved.

Description

Text classification method and device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of natural language processing, in particular to a text classification method, a text classification device, computer equipment and a storage medium.

Background

The multi-text classification is an important link in natural language processing and is widely applied to scenes such as emotion analysis, question and answer matching, search engines and the like.

Multi-text classification generally refers to an application that finds a target text from a plurality of texts through a classification model. In the related art, a classification model for classifying multiple texts generally comprises an output layer and multiple sets of parallel encoders, when performing text classification, the multiple sets of encoders encode multiple texts in parallel, each set of encoder is responsible for encoding one text to obtain a sentence vector of each text, and then the output layer performs unified processing on the sentence vectors of the multiple texts to output probabilities (i.e., classification results) that the multiple texts belong to target texts respectively.

However, in the related art, a plurality of texts are encoded in parallel by a plurality of sets of encoders, and the sentence vector of each text only represents the features of the corresponding current text, so that the information carried by the sentence vector is relatively thin, and the accuracy of text classification is affected.

Disclosure of Invention

The embodiment of the application provides a text classification method, a text classification device, computer equipment and a storage medium, which can improve the accuracy of text classification, and the technical scheme is as follows:

in one aspect, a text classification method is provided, and the method includes:

acquiring at least two texts to be classified, wherein each text to be classified comprises at least one word;

generating a long text containing the at least two texts to be classified;

processing the long text through a self-attention submodel in a classification model to obtain a fused word vector of each word in the long text, wherein the self-attention submodel is used for fusing the incidence relation among the words in the original word vector of each word;

processing the fused word vector of each word in the long text through an output sub-model in the classification model to obtain the classification results of the at least two texts to be classified; the classification result is used for indicating a target text in the at least two texts to be classified;

the classification model is a model obtained by training through a training data set, the training data set comprises at least two pieces of training data, each piece of training data comprises a long text sample composed of at least one positive sample text and at least one negative sample text, and a labeling result of the long text sample.

In another aspect, an apparatus for classifying text is provided, the apparatus including:

the text acquisition module is used for acquiring at least two texts to be classified, wherein each text to be classified comprises at least one word;

the long text generation module is used for generating a long text containing the at least two texts to be classified;

the first model processing module is used for processing the long text through a self-attention submodel in the classification model to obtain a fused word vector of each word in the long text, and the self-attention submodel is used for fusing the incidence relation among the words in the original word vector of each word;

the second model processing module is used for processing the fused word vector of each word in the long text through an output sub-model in the classification model to obtain the classification result of the at least two texts to be classified; the classification result is used for indicating a target text in the at least two texts to be classified;

Optionally, the self-attention submodel includes at least two self-attention encoders connected in sequence;

the first model processing module is configured to,

carrying out vector mapping on the long text to obtain an original word vector of each word in the long text;

and inputting the original word vector of each word in the long text into a first self-attention encoder of the at least two self-attention encoders, and obtaining the fused word vector of each word in the long text, which is output by a last self-attention encoder of the at least two self-attention encoders.

Optionally, each self-attention encoder includes a self-attention layer and a forward propagation layer;

the first model processing module is configured to, when an original word vector of each word in the long text is input into a first self-attention encoder of the at least two self-attention encoders, and a fused word vector of each word in the long text output by a last self-attention encoder of the at least two self-attention encoders is obtained,

fusing the input word vectors of the words through a self-attention layer in a target self-attention encoder to obtain fused word vectors of the words; the target self-attention encoder is any one of the at least two self-attention encoders;

performing forward propagation processing on the fused word vector of each word through a forward propagation layer in the target self-attention encoder to obtain the word vector of each word after the forward propagation processing;

and inputting the word vectors of the words after the forward propagation processing into the next layer in the classification model.

Optionally, the output submodel includes a full connection layer and an activation function;

the second model processing module is configured to,

processing the fused word vector of each word in the long text through the full-connection layer;

obtaining respective sentence vectors of the at least two texts to be classified according to the processing result of the full connection layer;

and processing respective sentence vectors of the at least two texts to be classified through the activation function to obtain the classification result.

Optionally, the processing result of the full-concatenation layer includes a full-concatenation processing vector of each word in the long text;

the second model processing module is configured to, when obtaining respective sentence vectors of the at least two texts to be classified according to the processing result of the full connection layer,

and dividing the full-connection processing vector of each word in the long text according to the position of each word in the long text in the at least two texts to be classified, so as to obtain respective sentence vectors of the at least two texts to be classified.

Optionally, the long text sample is obtained by splicing at least one positive sample text and at least one negative sample text end to end in a random order.

Optionally, the apparatus further comprises:

a model training module used for obtaining at least two texts to be classified before the text obtaining module obtains the at least two texts,

processing the long text sample through the self-attention submodel to obtain a fused word vector of each word in the long text sample;

processing the fused word vector of each word in the long text sample through the output sub-model to obtain a classification result of the at least one positive sample text and the at least one negative sample text;

and updating parameters in the classification model according to the classification results of the at least one positive sample text and the at least one negative sample text and the labeling results of the long text samples.

Optionally, when parameters in the classification model are updated according to the classification results of the at least one positive sample text and the at least one negative sample text, and the labeling results of the long text samples, the model training module is configured to,

obtaining a cross entropy loss function through the classification result of the at least one positive sample text and the at least one negative sample text and the labeling result of the long text sample;

and updating parameters in the classification model through the cross entropy loss function.

Optionally, the model training module is configured to,

and when the classification model is determined to be not converged according to the cross entropy loss function, updating parameters in the classification model through the cross entropy loss function.

Optionally, the self-attention submodel is a model based on bi-directionally encoded Representations (BERTs) from Transformers.

In yet another aspect, a computer device is provided, comprising a processor and a memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the text classification method as described above.

In yet another aspect, a computer-readable storage medium is provided having stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by a processor to implement a text classification method as described above.

The technical scheme provided by the application can comprise the following beneficial effects:

generating a long text containing at least two texts to be classified, processing the long text through a self-attention submodel in a classification model to obtain a fused word vector of each word in the long text, and processing the fused word vector of each word in the long text through an output submodel in the classification model to obtain a classification result of the at least two texts to be classified; because a single long text simultaneously contains a plurality of texts to be recognized, the fused word vector of each word in the long text not only fuses the incidence relation between the current word and other words in the current text to be recognized, but also fuses the incidence relation between the current word and each word in other texts to be recognized, thereby realizing the fusion of the incidence relation of words among different texts to be recognized, realizing text classification by combining the incidence relation among the texts to be classified in the process of classifying through the output sub-model, expanding the information basis of text classification and improving the accuracy of multi-text classification.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a system configuration diagram of an Artificial Intelligence (AI) application system based on natural language processing according to various embodiments of the present application;

FIG. 2 is a diagram illustrating a text classification flow according to an exemplary embodiment;

FIG. 3 is a schematic diagram of a classification model application according to the embodiment shown in FIG. 2;

FIG. 4 is a schematic diagram of model training involved in the embodiment shown in FIG. 2;

FIG. 5 is a flow diagram illustrating a method of text classification in accordance with an exemplary embodiment;

FIG. 6 is a schematic structural diagram of a self-attention encoder according to the embodiment shown in FIG. 5;

FIG. 7 is a flow chart illustrating an application of a classification model according to the embodiment shown in FIG. 5;

FIG. 8 is a flow diagram illustrating a classification model training method in accordance with an exemplary embodiment;

FIG. 9 is a schematic diagram illustrating a training process of the classification model according to the embodiment shown in FIG. 8;

fig. 10 is a block diagram showing a configuration of a text classification apparatus according to an exemplary embodiment;

FIG. 11 is a block diagram illustrating a computer device in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The text classification scheme can extract the correlation characteristics among a plurality of texts through a self-attention mechanism in the multi-text classification process so as to improve the accuracy of multi-text classification. For convenience of understanding, several terms referred to in the embodiments of the present application are explained below.

1) Artificial intelligence AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

2) Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

3) Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

4) Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

5) Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to technologies such as artificial intelligence natural language processing and machine learning, and is specifically explained by the following embodiment.

Referring to fig. 1, a system configuration diagram of an AI application system based on natural language processing according to various embodiments of the present application is shown. As shown in fig. 1, the system includes a terminal 110 and a server 120.

Terminal 110 may be a terminal device in various AI application scenarios.

For example, the terminal 110 may be a smart home device such as a smart television and a smart television set-top box, or the terminal 110 may be a mobile portable terminal such as a smart phone, a tablet computer and an e-book reader, or the terminal 110 may also be a smart wearable device such as smart glasses and a smart watch.

Among them, an AI application based on natural language processing may be installed in the terminal 110. For example, the AI application may be an intelligent question and answer, intelligent search, or the like.

The server 120 may be a server, or the server 120 may be a server cluster composed of several servers, or the server 120 may include one or more virtualization platforms, or the server 120 may also be a cloud computing service center.

The server 120 may be a server device that provides a background service for the AI application installed in the terminal 110.

Optionally, the system may further comprise a database 130.

The database 130 may be a Redis database, or may be another type of database. The database 130 is used to store various data, such as AI application data, model training data, and user account data.

The terminal 110 may be connected to the server 120 through a communication network. Optionally, the communication network is a wired network or a wireless network.

Optionally, the system may further include a management device (not shown in fig. 1), which is connected to the server 130 through a communication network. Optionally, the communication network is a wired network or a wireless network.

The AI application system based on natural language processing can perform multi-text classification through the classification model in the process of providing the AI application service, and provide the AI application service according to the multi-text classification result. The classification model may be set in the server 120, and trained and applied by the server 120; alternatively, the classification model may be provided in the terminal 110 and trained and updated by the server 120.

Optionally, the wireless network or wired network described above uses standard communication techniques and/or protocols. The Network is typically the Internet, but may be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, wireline or wireless Network, a private Network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.

Various data in the AI application system related to the embodiment of the present application may be stored in a block chain (Blockchain). For example, in one possible implementation, at least one of the terminal 110, the server 120, and the database 130 in the AI application system may be a node in a blockchain system. The various types of data may include, but are not limited to, data stored in the database 130.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. The blockchain is essentially a decentralized database, which is a string of data blocks associated by using cryptography, each data block contains information of a batch of network transactions, and the information is used for verifying the validity (anti-counterfeiting) of the information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.

The platform product service layer provides basic capability and an implementation framework of typical application, and developers can complete block chain implementation of business logic based on the basic capability and the characteristics of the superposed business. The application service layer provides the application service based on the block chain scheme for the business participants to use.

The scheme of each subsequent embodiment of the application is a model training and application scheme of text classification. FIG. 2 is a text classification flow diagram, shown in accordance with an exemplary embodiment. As shown in fig. 2, the text classification process may be performed by a computer device, such as the server 120 or the terminal 110 in the system shown in fig. 1, or the computer device may be a combination of the terminal 10 and the server 120. Upon performing text classification, the computer device performs the steps of:

s21, at least two texts to be classified are obtained, and each text to be classified comprises at least one word.

Each piece of text to be classified may be a paragraph composed of one or more sentences.

Optionally, the filtering condition may be keyword matching, uniform sampling, or the like.

And S22, generating a long text containing the at least two texts to be classified.

In this embodiment of the present application, before classifying at least two texts to be classified, a computer device first generates a single text, where the single text includes text contents of the at least two texts to be classified, and the single text is the long text.

And S23, processing the long text through a self-attention submodel in the classification model to obtain a fused word vector of each word in the long text, wherein the self-attention submodel is used for fusing the incidence relation among the words in the original word vector of each word.

In the embodiment of the present application, the self-attention submodel may be regarded as a single group of encoders, and the computer device performs unified encoding on a long text containing at least two texts to be classified through the self-attention submodel.

The self-attention submodel may be a machine learning model based on a self-attention mechanism, and when a piece of text including a plurality of words is processed, the machine learning model based on the self-attention mechanism may perform fusion according to an association relationship between the plurality of words.

In other words, after a single text containing a plurality of words is processed by a machine learning model based on the self-attention mechanism, the obtained word vector of each word in the single text represents the closeness of the relationship between the current word and other words in the single text in addition to the current word.

In the embodiment of the application, because a single long text simultaneously contains a plurality of texts to be recognized, the fused word vector of each word in the long text not only fuses the incidence relation between the current word and other words in the current text to be recognized, but also fuses the incidence relation between the current word and each word in other texts to be recognized. That is, in step S23, in addition to the fusion of the association relationship of words in the same text to be recognized, the fusion of the association relationship of words between different texts to be recognized can be realized.

S24, processing the fused word vector of each word in the long text through the output sub-model in the classification model to obtain the classification result of the at least two texts to be classified; the classification result is used for indicating a target text in the at least two texts to be classified.

In the embodiment of the application, the output sub-model in the classification model can process the fused word vector of each word output by the long text and output a classification result of whether at least two texts to be classified are target texts.

Optionally, the classification result may be probabilities that at least two texts to be classified are the target texts respectively.

The classification model is a model obtained by training a training data set, the training data set comprises at least two pieces of training data, each piece of training data comprises a long text sample composed of at least one positive sample text and at least one negative sample text, and a labeling result of the long text sample.

Please refer to fig. 3, which illustrates a schematic diagram of a classification model application according to an embodiment of the present application. As shown in fig. 3, the classification model 30 includes a self-attention submodel 310 and an output submodel 320, in the process of applying the model, the computer device obtains n texts to be classified, generates a long text containing n texts to be classified (step a), then inputs the long text into the self-attention submodel 310 (step b), the self-attention submodel 310 fuses and encodes the original word vectors of each word in the long text, and outputs a fused word vector of each word in the long text (step c), wherein the fused word vector of each word fuses the association relationship between the current word and each other word in the long text; the output sub-model 320 processes the processing result from the attention sub-model 310 and outputs a recognition result 330 (step d), wherein the recognition result may include probability values corresponding to the n texts to be classified respectively, and the probability values are probabilities that the corresponding texts to be classified belong to the target text. Further, the computer device may determine a target text among the n texts to be recognized according to the recognition result 330.

Accordingly, please refer to fig. 4, which shows a schematic diagram of model training according to an embodiment of the present application. As shown in fig. 4, in the training process of the classification model, the positive and negative sample texts are first spliced to form a long text sample, then the long text sample is input into the self-attention submodel for sufficient coding, then the output submodel outputs the multi-text classification result, and then the parameter is updated according to the labeling result and the multi-text classification result.

To sum up, in the scheme shown in the embodiment of the present application, a long text including at least two texts to be classified is generated, then the long text is processed through a self-attention sub-model in a classification model to obtain a fused word vector of each word in the long text, and then the fused word vector of each word in the long text is processed through an output sub-model in the classification model to obtain a classification result of the at least two texts to be classified; because a single long text simultaneously contains a plurality of texts to be recognized, the fused word vector of each word in the long text not only fuses the incidence relation between the current word and other words in the current text to be recognized, but also fuses the incidence relation between the current word and each word in other texts to be recognized, thereby realizing the fusion of the incidence relation of words among different texts to be recognized, realizing text classification by combining the incidence relation among the texts to be classified in the process of classifying through the output sub-model, expanding the information basis of text classification and improving the accuracy of multi-text classification.

Fig. 5 is a flowchart illustrating a text classification method according to an exemplary embodiment, which may be used in a computer device, for example, the computer device may be the server 120 or the terminal 110 in the system shown in fig. 1, or the computer device may be a combination of the terminal 10 and the server 120. As shown in fig. 5, the text classification method may include the steps of:

step 501, at least two texts to be classified are obtained, wherein each text to be classified comprises at least one word.

Alternatively, the at least two texts to be classified may be texts randomly selected by the computer device from the text data set.

Optionally, the at least two texts to be classified may also be texts randomly selected from the text data set by the computer device according to a preset filtering condition.

And 502, splicing the at least two texts to be classified end to obtain a long text.

In the embodiment of the application, when the computer device performs head-to-tail splicing on at least two texts to be classified, the computer device may perform head-to-tail splicing according to the obtaining sequence of the at least two texts to be classified to obtain the long text.

Optionally, when the computer device performs end-to-end splicing on at least two texts to be classified, the computer device may also perform end-to-end splicing on the at least two texts to be classified according to a random sequence, so as to obtain the long text.

Optionally, when the computer device performs head-to-tail concatenation on at least two texts to be classified, each text to be classified may be supplemented or truncated according to the length of the text to be classified supported by the classification model. For example, assuming that the length of the text to be classified supported by the classification model is m (that is, the text to be classified includes m words), for any text to be classified in at least two texts to be classified, when the length of the text to be classified is smaller than m, the computer device may fill the text to be classified to the length m by using preset words; or, when the length of the text to be classified is greater than m, the computer device may truncate the length of the text to be classified to m through trunk extraction, invalid word filtering, and the like.

Step 503, performing vector mapping on the long text to obtain an original word vector of each word in the long text.

In the embodiment of the application, the vector mapping can be performed on the long text in an Embedding manner. The literal understanding of Embedding is "Embedding", which is essentially a mapping from a semantic space to a vector space, and simultaneously, the relationship of an original sample in the semantic space is maintained in the vector space as much as possible, for example, the positions of two words with close semantics in the vector space are also relatively close.

In the embodiment of the present application, the original word vector of each word obtained by vector mapping the long text by the computer device may represent the corresponding word itself in the form of data.

Optionally, in the vector mapping process, the computer device may perform vector mapping in combination with the context of each word, and in different contexts, the same word may also be mapped to different original word vectors, for example, the word "apple" may be mapped to different original word vectors in different contexts.

Step 504, inputting the original word vector of each word in the long text into a first self-attention encoder of at least two self-attention encoders included in the self-attention submodel in the classification model, and obtaining a fused word vector of each word in the long text, which is output by a last self-attention encoder of the at least two self-attention encoders.

The self-attention submodel is used for fusing the incidence relation among the words in the original word vector of the words; and the self-attention submodel comprises at least two self-attention encoders which are connected in sequence.

In the embodiment of the present application, in order to improve the fusion accuracy of the word vectors of the words in the long text, a plurality of self-attention encoders may be arranged end to end (directly or indirectly connected) to form a self-attention submodel in the classification model, wherein an output of a self-attention encoder at a previous stage may be directly or indirectly used as an input of a self-attention encoder at a subsequent stage. When performing fusion coding on the word vectors of each word in the long text, the computer device may input the original word vectors of each word in the long text into the first self-attention encoder, sequentially perform fusion coding through the self-attention encoders at different levels, and output each word vector as a fusion word vector of each word in the long text from the last self-attention encoder.

Optionally, each self-attention encoder includes a self-attention layer and a forward propagation layer; when acquiring the fused word vector of each word in the long text, the computer equipment can fuse the input word vector of each word through a self-attention layer in a target self-attention encoder to obtain the fused word vector of each word; the target self-attention encoder is any one of the at least two self-attention encoders; performing forward propagation processing on the fused word vector of each word through a forward propagation layer in the target self-attention encoder to obtain the word vector of each word after the forward propagation processing; and inputting the word vector of each word after forward propagation processing into the next layer in the classification model.

Please refer to fig. 6, which illustrates a schematic structural diagram of a self-attention encoder according to an embodiment of the present application. As shown in fig. 6, the self-attention encoder 60 includes at least a self-attention layer 61 and a forward propagation layer 62. The self-attention layer 61 fuses word vectors of words in the long text based on a self-attention mechanism, for example, by performing convolution, scoring, and weighting calculation on the word vectors of the words, so that the fused word vector of each word not only represents the current word, but also carries the closeness of the association between the current word and each other word in the long text. The output result from the attention layer 61 is subjected to certain processing such as residual processing and normalization processing, then input to the forward propagation layer 62, and subjected to processing such as weighting and bias processing by the forward propagation layer 62, and then output to the next layer in the classification model. Optionally, the forward propagation layer may be a feedforward neural network.

In the above steps, only at least two self-attention encoders included in the self-attention submodel are taken as an example, and each self-attention encoder includes a self-attention layer and a forward propagation layer.

And 505, processing the fused word vector of each word in the long text through the full-connection layer of the output sub-model in the classification model.

In the embodiment of the present application, the output submodels in the classification model may include, but are not limited to, a fully connected layer and an activation function. Wherein the fully-connected layer may comprise one or more fully-connected layers.

Step 506, obtaining respective sentence vectors of the at least two texts to be classified according to the processing result of the full connection layer.

Optionally, the processing result of the full-concatenation layer includes a full-concatenation processing vector of each word in the long text; when obtaining respective sentence vectors of the at least two texts to be classified according to the processing result of the full connection layer, the computer device may divide the full connection processing vector of each word in the long text according to the position of the word of each text to be classified in the at least two texts to be classified in the long text, so as to obtain respective sentence vectors of the at least two texts to be classified.

And 507, processing respective sentence vectors of the at least two texts to be classified through an activation function in the classification model to obtain a classification result.

And the classification result is used for indicating a target text in the at least two texts to be classified.

In the processing process, the relative position between each word in the long text is kept unchanged, and correspondingly, the relative position between the words corresponding to each text to be classified in the long text is also fixed, so after the long text is processed by the full-connection layer, the computer equipment can determine word vectors corresponding to the words of each text to be classified from the output result of the full-connection layer according to the relative position between the texts to be classified, and the combination of the word vectors corresponding to each text to be classified is regarded as a sentence vector of the corresponding text to be classified; and subsequently, sentence vectors of the texts to be classified are processed through the activation function, so that the probability that the texts to be classified are the target texts can be obtained.

Alternatively, the activation function may be a softmax function.

The self-attention submodel in the classification model can be realized based on a BERT model.

For example, please refer to fig. 7, which shows a schematic flow chart of an application of the classification model according to an embodiment of the present application. As shown in fig. 7, the classification model includes a BERT model composed of N self-attention encoders 71, and an output model 72 composed of a fully-connected layer 72a and a softmax function 72b, each self-attention encoder 71 including a self-attention layer 71a and a forward propagation layer 71 b; after at least two texts to be detected (i.e., the texts 1 to n in fig. 7) are connected end to form a long text, the long text is mapped to a vector space to obtain an original word vector of each word in the long text, and an arrangement sequence of the original word vector is the same as an arrangement sequence of each word in the long text, and the original word vector of each word in the long text is processed by a self-attention layer 71a and a forward propagation layer 71b in each self-attention encoder 71 in sequence and then output as a fused word vector of each word in the long text; the fused word vector of each word in the long text is processed by the full-link layer 72a to output sentence vectors (shown as sentence vector 1 to sentence vector n in fig. 7) of each text to be classified, and then the sentence vectors of each text to be classified are processed by the softmax function 72b to output a recognition result 73, where the recognition result 73 includes probabilities (shown as probability 1 to probability n in fig. 7) that each text to be classified is a target text.

The classification model related to fig. 2 to 5 may be obtained by training a pre-labeled training data set, where the training data set may be organized according to the form of the long text, that is, each piece of training data includes a long text sample and a labeling result for the long text sample, each long text sample includes a plurality of sample texts, each plurality of sample texts includes at least one positive sample text and at least one negative sample text, and the labeling result for the long text sample indicates which of the at least one positive sample text and the at least one negative sample text belongs to a positive sample and which belongs to a negative sample. The following embodiments of the present application describe the training process of the classification model.

FIG. 8 is a flowchart illustrating a classification model training method according to an example embodiment, which may be used in a computer device, such as the server 120 in the system of FIG. 1 described above. As shown in fig. 8, the classification model training method may include the following steps:

step 801, a training data set is obtained, where the training data set includes at least two pieces of training data, and each piece of training data includes a long text sample composed of at least one positive sample text and at least one negative sample text, and a labeling result of the long text sample.

In this embodiment of the application, when the computer device trains the classification model, a developer may set that each piece of training data includes one positive sample text and a plurality of negative sample texts, and the purpose of the model training is to enable the classification model to recognize the positive sample text in the training data as accurately as possible.

Since the classification model focuses on the relative position relationship between the word vectors in the training process, and the target text may appear at any position in the long text in the model application process shown in fig. 2 or fig. 3, in order to avoid that the trained model focuses on the position relationship between the word vectors too much and affects the classification accuracy, the positive sample text and the negative sample text may be combined in a random order to generate training data when organizing the training data.

Step 802, processing the long text sample through a self-attention submodel in the classification model to obtain a fused word vector of each word in the long text sample.

Step 803, the fused word vector of each word in the long text sample is processed through the output submodel in the classification model, and the classification result of the at least one positive sample text and the at least one negative sample text is obtained.

The process of fusing and classifying the word vectors in the long text sample in steps 802 and 803 is similar to the process of fusing and classifying the word vectors in the long text sample in the embodiment shown in fig. 5, and is not described here again.

Step 804, updating the parameters in the classification model according to the classification result of the at least one positive sample text and the at least one negative sample text and the labeling result of the long text sample.

In the embodiment of the application, in the process of training the classification model, the computer device can update the parameters in the classification model through the difference between the classification result output by the classification model and the labeling result, so as to achieve the purpose of training the classification model.

Optionally, the updating of the parameters in the classification model may be updating of matrix parameters such as a weight matrix and a bias matrix in the classification model. The weight matrix and the bias matrix include, but are not limited to, matrix parameters in a self-attention layer, a forward propagation layer and a full connection layer in the classification model.

Optionally, when the parameters in the classification model are updated according to the classification results of the at least one positive sample text and the at least one negative sample text and the labeling result of the long text sample, the computer device may obtain a cross entropy loss function through the classification results of the at least one positive sample text and the at least one negative sample text and the labeling result of the long text sample; and updating parameters in the classification model through the cross entropy loss function.

The Cross Entropy (Cross Entropy) is an important concept in shannon information theory, and is mainly used for measuring the difference information between two probability distributions. According to the scheme shown in the embodiment of the application, the difference between the classification result and the labeling result of the classification model can be measured by taking the cross entropy as a loss function, and the cross entropy loss function is used for performing back propagation in the classification model so as to update various parameters in the classification model.

Optionally, when the parameters in the classification model are updated through the cross entropy loss function, the parameters in the classification model may be updated through the cross entropy loss function when it is determined that the classification model is not converged according to the cross entropy loss function.

The convergence of the classification model may mean that a difference between an output result of the classification model for the training data and a labeling result of the training data is smaller than a predetermined threshold, or a change rate of the difference between the output result and the labeling result of the training data approaches a certain lower value (for example, approaches 0).

In a possible implementation scheme, when judging whether the classification model converges, whether the classification model converges can be directly judged through the classification result and the labeling result of the training data. For example, after a certain round of iterative training is completed, a cross entropy loss function is calculated through a classification result output by the current round of iterative training and a labeling result, and if the calculated loss function is small or a difference value between the calculated loss function and the cross entropy loss function of the previous round of iterative training approaches to 0, the classification model is considered to be converged.

In another possible implementation scheme, when determining whether the classification model converges, whether the classification model converges may be verified through a verification data set other than the training data, where the verification data is consistent with the training data in organization manner, that is, the verification data also includes a positive sample text and a plurality of negative sample texts, and the verification data also corresponds to the labeled data. For example, after a certain round of iterative training is completed, the verification data except the training data is input into the classification model to obtain the classification result of the verification data; and calculating a cross entropy function according to the classification result output by the classification model and the labeling result of the verification data, and if the calculated loss function is smaller or the difference between the calculated loss function and the cross entropy loss function obtained by processing the verification data after the previous iteration is close to 0, determining that the classification model is converged.

For example, please refer to fig. 9, which shows a schematic diagram of a training process of a classification model according to an embodiment of the present application. As shown in fig. 9, the classification model includes a BERT model composed of N self-attention encoders 91, and an output model 92 composed of a fully-connected layer 92a and a softmax function 92b, each self-attention encoder 91 including a self-attention layer 91a and a forward propagation layer 91 b; the long text sample comprises a positive sample text and a plurality of negative sample texts, the long text sample is mapped to a vector space to obtain an original word vector of each word in the long text sample, and the original word vector of each word in the long text sample is output as a fused word vector of each word in the long text sample after being sequentially processed by a self-attention layer 91a and a forward propagation layer 91b in each self-attention encoder 91; the fused word vector of each word in the long text sample is processed by the full-link layer 92a to output a sentence vector of each sample text, and then the sentence vector of each sample text is processed by the softmax function 92b to output a recognition result 93, wherein the recognition result 93 comprises the probability that each sample text is the target text. The computer device calculates a cross entropy loss function 95 between the recognition result 93 and the labeling result 94 of the long text sample, and updates each parameter in the classification model according to the cross entropy loss function 95.

In the BERT model shown in fig. 9, each sample text is sufficiently fused with other sample texts by using a self-attention mechanism of a transformer, so that the output information includes information after fusion, and then a forward propagation layer is accessed. By doing N times of operations, the information between the positive and negative samples is fully fused from the word level of the bottom layer to the sentence level of the top layer.

In the output model shown in fig. 9, the word vectors in the samples are projected onto the sentence vectors through the full-concatenation layer, so that each sample has its own vector representation. Assuming n samples of [ X1, X2, …, Xn ], we get a sentence vector that is an n X dim dimensional matrix of [ h1, …, hn ], dim being the vector dimension. After the sentence vectors are input into the softmax layer, n probabilities of [ p1, … pn ] are obtained, representing the probability that each sentence belongs to a positive sample. Assuming that the 2 nd sample is a positive sample, the expected target (i.e., the annotation result) is [0,1,0, …,0], so that model training is performed by calculating the cross entropy between the classification result and the expected target.

In summary, according to the scheme shown in the embodiment of the application, the classification model is trained through the long text sample including the positive sample text and the negative sample text and the labeling result of the long text sample, so that the trained classification model can fuse the incidence relation between the current word and each word in other texts to be recognized when the long text including a plurality of texts to be classified is processed, thereby realizing the fusion of the incidence relation between words in different texts to be recognized.

The training and application scheme of the classification model shown in each embodiment of the present application may be applied to any scene related to multi-text classification and an Artificial Intelligence (AI) scene for performing subsequent application according to a classified target text, for example, the training and application scheme of the classification model shown in the embodiment of the present application may identify answers to questions from a plurality of texts by an AI, and provide AI services such as intelligent question answering, information retrieval, reading understanding and the like by combining the identified answers.

Fig. 10 is a block diagram illustrating a structure of a text classification apparatus according to an exemplary embodiment. The text classification apparatus may be used in a computer device to perform all or part of the steps in the embodiments shown in fig. 2, 5 or 8. The computer device may be the server 120 or the terminal 110 in the system shown in fig. 1, or the computer device may be a combination of the terminal 110 and the server 120. The text classification apparatus may include:

the text obtaining module 1001 is configured to obtain at least two texts to be classified, where each text to be classified includes at least one word;

a long text generating module 1002, configured to generate a long text including the at least two texts to be classified;

a first model processing module 1003, configured to process the long text through a self-attention submodel in the classification model to obtain a fused word vector of each word in the long text, where the self-attention submodel is used to fuse an association relationship between the words in an original word vector of each word;

the second model processing module 1004 is configured to process the fused word vector of each word in the long text through an output sub-model in the classification model, so as to obtain a classification result of the at least two texts to be classified; the classification result is used for indicating a target text in the at least two texts to be classified;

Optionally, the long text generating module 1002 is configured to splice the at least two texts to be classified end to obtain the long text.

the first model processing module 1003 is configured to,

when the original word vector of each word in the long text is input into the first self-attention encoder of the at least two self-attention encoders, and the fused word vector of each word in the long text output by the last self-attention encoder of the at least two self-attention encoders is obtained, the first model processing module 1003 is configured to,

the second model processing module 1004 is configured to,

the second model processing module 1004 is configured to, when obtaining respective sentence vectors of the at least two texts to be classified according to the processing result of the full connection layer,

Optionally, the apparatus further comprises:

a model training module, configured to, before the text obtaining module 1001 obtains at least two texts to be classified,

Optionally, the model training module is configured to,

Optionally, the self-attention submodel is a model representing BERT based on bi-directional coding from a transformer.

FIG. 10 is a block diagram illustrating a computer device according to an example embodiment. The computer apparatus 1000 includes a Central Processing Unit (CPU) 1001, a system Memory 1004 including a Random Access Memory (RAM) 1002 and a Read-Only Memory (ROM) 1003, and a system bus 1005 connecting the system Memory 1004 and the Central Processing Unit 1001. The computer device 1000 also includes a basic Input/Output (I/O) system 1006, which facilitates the transfer of information between devices within the computer, and a mass storage device 1007, which stores an operating system 1013, application programs 1014, and other program modules 1015.

The basic input/output system 1006 includes a display 1008 for displaying information and an input device 1009, such as a mouse, keyboard, etc., for user input of information. Wherein the display 1008 and input device 1009 are connected to the central processing unit 1001 through an input-output controller 1010 connected to the system bus 1005. The basic input/output system 1006 may also include an input/output controller 1010 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input-output controller 1010 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1007 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the computer device 1000. That is, the mass storage device 1007 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read-Only Memory (CD-ROM) drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1004 and mass storage device 1007 described above may be collectively referred to as memory.

The computer device 1000 may be connected to the internet or other network devices through a network interface unit 1011 connected to the system bus 1005.

The memory further includes one or more programs, the one or more programs are stored in the memory, and the central processing unit 1001 implements all or part of the steps of the method shown in fig. 2, 5, or 8 by executing the one or more programs.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as a memory comprising computer programs (instructions), executable by a processor of a computer device to perform all or part of the steps of the methods shown in the various embodiments of the present application, is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of text classification, the method comprising:

generating a long text containing the at least two texts to be classified;

2. The method of claim 1, wherein the generating a long text containing the at least two texts to be classified comprises:

and performing head-to-tail splicing on the at least two texts to be classified to obtain the long text.

3. The method of claim 1, wherein the self-attention submodel comprises at least two self-attention encoders connected in sequence;

the processing the long text through a self-attention submodel in a classification model to obtain a fused word vector of each word in the long text comprises:

4. The method of claim 3, wherein each of the self-attention encoders includes a self-attention layer and a forward propagation layer;

the inputting an original word vector of each word in the long text into a first self-attention encoder of the at least two self-attention encoders, and obtaining a fused word vector of each word in the long text, which is output by a last self-attention encoder of the at least two self-attention encoders, includes:

5. The method of claim 1, wherein the output submodel comprises a fully connected layer and an activation function;

the processing the fused word vector of each word in the long text through the output submodel in the classification model to obtain the classification results of the at least two texts to be classified comprises:

6. The method of claim 5, wherein the processing result of the fully-connected layer comprises a fully-connected processing vector of each word in the long text;

the obtaining of respective sentence vectors of the at least two texts to be classified according to the processing result of the full connection layer includes:

7. The method of claim 1, wherein the long text samples are obtained by end-to-end concatenation of at least one positive sample text and at least one negative sample text in a random order.

8. The method according to claim 1, wherein before the obtaining at least two texts to be classified, further comprising:

9. The method of claim 8, wherein the updating parameters in the classification model according to the classification results of the at least one positive sample text and the at least one negative sample text and the labeling results of the long text samples comprises:

10. The method of claim 9, wherein the updating parameters in the classification model by the cross-entropy loss function comprises:

11. Method according to any of claims 1 to 10, characterized in that the self-attention submodel is a model representing BERT based on bi-directional coding from a transformer.

12. An apparatus for classifying text, the apparatus comprising:

13. The apparatus of claim 12,

and the long text generation module is used for splicing the at least two texts to be classified end to obtain the long text.

14. A computer device, characterized in that the computer device comprises a processor and a memory, in which a program is stored which is executed by the processor to implement the text classification method according to any one of claims 1 to 11.

15. A computer-readable storage medium having stored thereon instructions for execution by a processor of a computer device to implement a text classification method according to any one of claims 1 to 11.