CN112686021A

CN112686021A - Text feature extraction method, text feature extraction device, and storage medium

Info

Publication number: CN112686021A
Application number: CN202110001286.5A
Authority: CN
Inventors: 陈明
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-01-04
Filing date: 2021-01-04
Publication date: 2021-04-20

Abstract

The present disclosure relates to a text feature extraction method, a text feature extraction device, and a storage medium. The text feature extraction method comprises the following steps: and acquiring a target task to be predicted. Determining a text set matched with the target task to be predicted according to the field corresponding to the target task to be predicted, wherein the text set comprises a plurality of texts. And extracting text features of the text in the text set through the pre-trained language model. By the text feature extraction method, the target task to be predicted based on natural language processing can be matched with a proper text set, and then the text features are extracted in a targeted manner, so that the cost of deploying a pre-trained language model is reduced, the training difficulty is reduced, and the progress of the natural language processing task is accelerated.

Description

Text feature extraction method, text feature extraction device, and storage medium

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a text feature extraction method, a text feature extraction device, and a storage medium.

Background

When the trained neural network model obtains the trained language model, initialization parameters need to be determined in advance. If random initial parameters are used, a sufficient number of training sets are required, and the convergence of parameters in the neural network model is slow, resulting in a long training process. Based on this, the concept of pre-training should be followed. The pre-training is used for training parameters in the neural network before the neural network model is trained, so that proper initialization parameters are obtained, the proper initialization parameters are set for the training of the language model, and the optimization and convergence of the training language model are accelerated.

In the related art, when a pre-trained language model is used for text feature extraction, the extraction is usually performed based on texts stored in a database. Due to the fact that the data in the database are multiple and the types are complex, the cost for deploying the pre-trained language model is too high, and the training difficulty of the training language model is high.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a text feature extraction method, a text feature extraction apparatus, and a storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a text feature extraction method, including: and acquiring a target task to be predicted. And determining a text set matched with the target task to be predicted according to the field corresponding to the target task to be predicted, wherein the text set comprises a plurality of texts. And extracting text features of the text in the text set through a pre-trained language model.

In one embodiment, the language model is pre-trained in the following manner: a plurality of training texts in a specified field are obtained. Inputting a plurality of training texts into a neural network model in batches. And aiming at each batch, extracting the text features of the training texts corresponding to the batch in a random mask mode, and adjusting the neural network model according to the extracted text features to obtain the language model.

In another embodiment, the extracting the text features of the training texts corresponding to the batch by using a random mask includes: and aiming at the training texts in each batch, performing one-hot coding extraction on the training texts corresponding to the current batch in a random mask mode to obtain text features of the training texts corresponding to the current batch.

In another embodiment, after determining, according to the domain corresponding to the target task to be predicted, the text set matching the target task to be predicted, the text feature extraction method further includes: and if determining that newly added texts exist in one or more fields corresponding to the target task to be predicted, adding the newly added texts into a text set matched with the target task to be predicted.

In yet another embodiment, the fields include: a general text field or an e-commerce field.

According to a second aspect of the embodiments of the present disclosure, there is provided a text feature extraction device including: and the acquisition unit is used for acquiring the target task to be predicted. And the determining unit is used for determining a text set matched with the target task to be predicted according to the field corresponding to the target task to be predicted, wherein the text set comprises a plurality of texts. And the extraction unit is used for extracting the text features of the texts in the text set through a pre-trained language model.

In another embodiment, the extracting, by the language model, the text features of the training texts corresponding to the batch by using a random mask in the following manner includes: and aiming at the training texts in each batch, performing one-hot coding extraction on the training texts corresponding to the current batch in a random mask mode to obtain text features of the training texts corresponding to the current batch.

In still another embodiment, the text feature extraction apparatus further includes: and the updating unit is used for adding the newly added text into a text set matched with the target task to be predicted if the newly added text exists in one or more fields corresponding to the target task to be predicted.

According to a third aspect of the embodiments of the present disclosure, there is provided a text feature extraction device including: a memory to store instructions; and the processor is used for calling the instructions stored in the memory to execute any one of the text feature extraction methods.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, in which instructions are stored, and when the instructions are executed by a processor, the method for extracting text features is performed.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: by the text feature extraction method, the target task to be predicted based on natural language processing can be matched with a proper text set, and then the text features are extracted in a targeted manner, so that the cost of deploying a pre-trained language model is reduced, the training difficulty is reduced, and the progress of the natural language processing task is accelerated.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating a method of text feature extraction according to an example embodiment.

FIG. 2 is a flow diagram illustrating a method for pre-training a language model in accordance with an exemplary embodiment.

FIG. 3 is a flow diagram illustrating another method of text feature extraction according to an example embodiment.

FIG. 4 is a schematic diagram illustrating a training process according to an exemplary embodiment.

Fig. 5 is a block diagram illustrating a text feature extraction apparatus according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In the related art, when a pre-trained language model is used for text feature extraction, the fields for text feature extraction are selected indiscriminately, and then the text features corresponding to the fields are obtained, so that the language model is high in training deployment cost and training difficulty. In the natural language processing task, if the field for extracting the text signs is matched with the target task to be predicted, the target task to be predicted can be smoothly performed. If the field for extracting the text signs is not matched with the target task to be predicted, the target task to be predicted is performed again when the field for extracting the text signs is matched with the target task to be predicted, and time is long.

In view of this, the present disclosure provides a text feature extraction method, which can determine a field in which text extraction is currently required according to a target task to be predicted in a natural language processing task, and further perform targeted feature extraction, thereby reducing training difficulty, promoting a natural language processing process, and reducing extraction cost.

Fig. 1 is a flowchart illustrating a text feature extraction method according to an exemplary embodiment, and as shown in fig. 1, the text feature extraction method includes the following steps S11 to S12.

In step S11, a target task to be predicted is acquired.

In the embodiment of the present disclosure, the text set used for feature extraction is determined based on the domain corresponding to the target task to be predicted of the current natural language processing task. The target task to be predicted may include: general domain emotion analysis, e-commerce domain multi-label text classification, or e-commerce domain name of goods extraction (NER). Therefore, in order to prevent data redundancy caused by extraction of too many invalid text features, the target task to be predicted is obtained before text feature extraction is carried out, and then targeted extraction can be carried out during text feature extraction.

In step S12, a text set matching the target task to be predicted is determined according to the domain corresponding to the target task to be predicted.

In the embodiment of the present disclosure, in order to make the extracted text features closer to the corresponding fields, when one or more text sets corresponding to one or more fields corresponding to the target task to be predicted are matched, a plurality of texts in each field are obtained, so that the text set corresponding to each field has enough texts for feature extraction. The texts in the text set may be from a local database, a cloud, or the internet, and are not limited in this disclosure.

In one example, the domains include: a general text field or an e-commerce field. Wherein, the general text field may include: news, social media, wikipedia, hundred degree encyclopedia, amazon reviews, movie reviews, and the like.

In step S13, the text features of the text in the corpus are extracted through the pre-trained language model.

In the embodiment of the disclosure, the pre-trained language model is used for text feature extraction, so that the language model with high performance and good effect can be obtained at the same time of low training cost, and the model parameters of the language model are rapidly converged. And then the text characteristics of the texts in each text set in the plurality of text sets are extracted through the pre-trained language model, so that the parameter quantity of the language model can be reduced, and the training cost of the language model can be reduced.

In an implementation scenario, in a natural language processing task, multiple target tasks to be predicted share the same pre-training language model. And further, when text feature extraction is carried out, extraction or training can be carried out in a targeted manner, so that the calculation cost of the pre-trained language model during the text feature extraction is saved, and the process of the text feature extraction is accelerated.

By the embodiment, the field needing text extraction at present is determined based on the target task to be predicted in the natural language processing task, and targeted feature extraction is performed, so that the extraction cost is reduced, the training difficulty is simplified, the model parameters of the language model are promoted to be converged quickly, and the process of natural language processing is accelerated.

The embodiment of the present disclosure will be explained below for a pre-training process of a language model.

Fig. 2 is a flowchart illustrating a method for pre-training a language model according to an exemplary embodiment, where the method for pre-training a language model, as shown in fig. 2, includes the following steps S21 to S23.

In step S21, a plurality of training texts in a specified domain are acquired.

In step S22, a plurality of training texts are input into the neural network model in batches.

In the embodiment of the disclosure, when text feature extraction is performed, a plurality of training texts are input to the neural network in batches, which is helpful for accelerating training speed and timely discovering abnormal parameter convergence phenomenon in the training process, thereby avoiding inaccurate training result or unsatisfactory parameter fitting effect. In addition, in order to facilitate the extracted text features to be closer to the expression of people in the designated field, when the text features are extracted aiming at each batch of corresponding training texts, the random mask mode is adopted to cover characters in the training texts and increase noise, so that the robustness of the neural network model is improved. Furthermore, when training is performed, the pre-training of the language model can be realized under the condition that the number of training texts is small without being influenced by the number of training texts.

In step S23, for each batch, the text features of the training texts corresponding to the batch are extracted in a random mask manner, and the neural network model is adjusted according to the extracted text features to obtain the language model.

In the embodiment of the disclosure, in the pre-training process, the plurality of text features extracted from each batch are compared with the training text in the designated field, and the model parameters in the neural network are debugged according to the comparison result, so that the convergence of the model parameters is accelerated, and the reliability of the model parameters is further improved, so that the obtained pre-trained language model can provide a better initial state during formal training of the language model, and an ideal training effect is achieved.

In One embodiment, when text feature extraction is performed on each batch of training texts in a random mask mode, One-Hot coding (One-Hot) is used for masking, and then the text features of the training texts corresponding to the current batch are obtained by using randomly masked texts. When the training texts corresponding to the current batch are extracted by using the one-hot coding, the token vector, the corresponding segment vector and the corresponding position vector corresponding to the character are covered aiming at the randomly extracted character each time, and further, when the pre-trained language model is debugged, the robustness of the language network can be enhanced based on the randomly covered token vector, segment vector and position vector, so that the performance of the language model is favorably optimized.

The text features convenient to extract have real-time performance and can meet the expression mode of people at present. The present disclosure also provides another text feature extraction method.

Fig. 3 is a flowchart illustrating another text feature extraction method according to an exemplary embodiment, and as shown in fig. 3, the text feature extraction method includes the following steps S31 to S34.

In step S31, a target task to be predicted is acquired.

In step S32, a text set matching the target task to be predicted is determined according to the domain corresponding to the target task to be predicted.

In step S33, if it is determined that the newly added text exists in one or more fields corresponding to the target task to be predicted, the newly added text is added to the text set matching the target task to be predicted.

In the disclosed embodiment, as society develops, in each field, people continuously generate new words or adopt homonyms to describe something in a designated field in a different way in the process of communication. Therefore, in order to facilitate the extracted text features to be more suitable for the development of the current field, when the text set is obtained, the newly added text is added into a plurality of text sets matched with the target task to be predicted according to the existence of the newly added text in one or more fields corresponding to the target task to be predicted, and then the text in the text sets is updated in real time. For example: in the e-commerce domain, there has been less text associated with the e-commerce domain in the past, and thus a smaller amount of text associated with the e-commerce domain is obtained. With the development of the technology in recent years, people have increasingly expressed the field. Further, when pre-training, the text corresponding to e-commerce reviews may be added to the text set of the matching e-commerce domain. Therefore, the development of the extracted text features in the field of more fitting under the current social environment can be realized while the training corpus is expanded.

In step S34, the text features of the text in the corpus are extracted through the pre-trained language model.

In an implementation scenario, when the language model is pre-trained, the neural network framework used may be based on the ALBERT framework, and a part of innovation points of the RoBERTa framework are combined to obtain a new neural network framework, and then the new neural network framework is adopted for training. The flow of the new neural network framework in training can be as shown in the neural network model framework flow diagram of fig. 4. The structure of the ALBERT framework has the characteristics of embedded vector parameterized factorization (factored embedding), Cross-layer parameter sharing (Cross-layer parameter sharing), Inter-sentence coherence loss (Inter-sensor coherence loss), drop (removing drop) and the like. The structure of the RoBERTA framework has the characteristics of more random MASKs (MASK) and training data, longer training time and the like. The innovation characteristics of the two are combined, so that the accuracy of a natural language processing task can be improved, the time required by training can be shortened, the parameter quantity of a neural network model can be reduced, and the cost of model deployment can be reduced in the pre-training process. And further obtain the initialized language model with ideal effect and better performance.

In another implementation scenario, the language model is constructed from an ALBERT framework-based neural network model. In training the language model, training text for training is input in batches into the ALBERT neural network model based on the ALBERT framework. When the text features of each batch of training texts are extracted by using One-Hot coding (One-Hot), the input training texts are subjected to random MASK processing by adopting a random MASK (MASK) mode, and token vectors, corresponding segment vectors and corresponding position vectors corresponding to the characters are extracted aiming at the randomly masked characters. The token vector, segment vector, and location vector of the word are masked using one-hot encoding. And then performing text feature extraction on each batch of randomly masked training texts by processing modes such as factorization of embedded vector parameterization, cross-layer parameter sharing, inter-sentence continuity loss, dropout removal and the like in the ALBERT neural network model, thereby finally extracting the text features of each batch of training texts. And because the extraction features are extracted by adopting a random mask mode. Therefore, when the language model is trained, the more training data, the longer the training time, and the stronger the robustness of the finally obtained language model. The language model obtained by training in the mode can reduce the parameter quantity of the model and the deployment cost when the language model is deployed in the early stage of training. And when the training data volume is less, the language model obtained by adopting the method for training also has the characteristics of good performance and strong robustness.

In yet another implementation scenario, the training process to extract text features may be as shown in FIG. 4. Fig. 4 is a schematic diagram of a training process. The training text may include past text in the general field, text in the e-commerce field, and newly added text in recent years. A plurality of training texts are input into an ALBERT neural network model based on an ALBERT framework in batches. And aiming at a plurality of training texts input in each batch, extracting randomly masked words and token vectors, segment vectors and position vectors corresponding to the words in a dynamic random mask mode. And the token vector, the segment vector and the position vector corresponding to the randomly masked characters are masked by using One-Hot coding (One-Hot). And performing text feature extraction on the masked token vector, the masked segment vector and the masked text of the position vector by processing modes of factorization of embedded vector parameterization, cross-layer parameter sharing, inter-sentence continuity loss, dropout removal and the like in the ALBERT neural network model. And further obtaining the text features of each batch of texts output by the ALBERT neural network model, and completing the text feature extraction of each batch of texts. And adjusting the ALBERT neural network model according to the text features extracted from each batch of texts, finishing the training of the ALBERT neural network model and obtaining a pre-trained language model.

Based on the same conception, the embodiment of the disclosure also provides a text feature extraction device.

It is understood that, in order to implement the above functions, the text feature extraction apparatus provided in the embodiments of the present disclosure includes a hardware structure or a software module corresponding to each function. The disclosed embodiments can be implemented in hardware or a combination of hardware and computer software, in combination with the exemplary elements and algorithm steps disclosed in the disclosed embodiments. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

Fig. 5 is a block diagram illustrating a text feature extraction apparatus according to an example embodiment. Referring to fig. 5, the text feature extraction apparatus 100 includes an acquisition unit 101, a determination unit 102, and an extraction unit 103.

An obtaining unit 101 is configured to obtain a target task to be predicted.

The determining unit 102 is configured to determine, according to a field corresponding to the target task to be predicted, a text set matching the target task to be predicted, where the text set includes a plurality of texts.

And the extraction unit 103 is used for extracting the text features of the texts in the text set through the pre-trained language model.

In one embodiment, the language model is pre-trained in the following manner: a plurality of training texts in a specified field are obtained. Inputting a plurality of training texts into the neural network model in batches. And aiming at each batch, extracting the text features of the training texts corresponding to the batch in a random mask mode, and adjusting the neural network model according to the extracted text features to obtain the language model.

In another embodiment, the language model extracts the text features of the training texts corresponding to the batch in a random mask manner by using the following method, including: and aiming at the training texts in each batch, performing unique hot mask extraction on the training texts corresponding to the current batch in a random mask mode to obtain the text characteristics of the training texts corresponding to the current batch.

In still another embodiment, the text feature extraction apparatus 100 further includes: and the updating unit 104 is configured to, if it is determined that the newly added text exists in one or more fields corresponding to the target task to be predicted, add the newly added text to a text set matching the target task to be predicted.

In yet another embodiment, the domains include: a general text field or an e-commerce field.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Further, in an exemplary embodiment, the text feature extraction means may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods. For example, the text feature extraction means includes: a memory to store instructions; and the processor is used for calling the instructions stored in the memory to execute the text feature extraction method provided by any one of the above embodiments.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided that includes instructions, such as a memory, that are executable by a processor of a text feature extraction apparatus to perform the above-described method. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It is further understood that the use of "a plurality" in this disclosure means two or more, as other terms are analogous. "or", which describes the association relationship of the associated objects, means that there may be three relationships, e.g., a or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. The singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It will be further understood that the terms "first," "second," and the like are used to describe various information and that such information should not be limited by these terms. These terms are only used to distinguish one type of information from another and do not denote a particular order or importance. Indeed, the terms "first," "second," and the like are fully interchangeable. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure.

It will be further understood that, unless otherwise specified, "connected" includes direct connections between the two without the presence of other elements, as well as indirect connections between the two with the presence of other elements.

It is further to be understood that while operations are depicted in the drawings in a particular order, this is not to be understood as requiring that such operations be performed in the particular order shown or in serial order, or that all illustrated operations be performed, to achieve desirable results. In certain environments, multitasking and parallel processing may be advantageous.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A text feature extraction method is characterized by comprising the following steps:

acquiring a target task to be predicted;

determining a text set matched with the target task to be predicted according to the field corresponding to the target task to be predicted, wherein the text set comprises a plurality of texts;

and extracting text features of the text in the text set through a pre-trained language model.

2. The method of extracting text features of claim 1, wherein the language model is pre-trained in the following way:

acquiring a plurality of training texts in a specified field;

inputting a plurality of training texts into a neural network model in batches;

and aiming at each batch, extracting the text features of the training texts corresponding to the batch in a random mask mode, and adjusting the neural network model according to the extracted text features to obtain the language model.

3. The method of extracting text features according to claim 2, wherein the extracting the text features of the training texts corresponding to the batch by using a random mask includes:

and aiming at the training texts in each batch, performing one-hot coding extraction on the training texts corresponding to the current batch in a random mask mode to obtain text features of the training texts corresponding to the current batch.

4. The method for extracting text features according to claim 1, wherein after determining the text set matching the target task to be predicted according to the field corresponding to the target task to be predicted, the method for extracting text features further comprises:

and if determining that newly added texts exist in one or more fields corresponding to the target task to be predicted, adding the newly added texts into a text set matched with the target task to be predicted.

5. The text feature extraction method according to claim 1, wherein the domain includes: a general text field or an e-commerce field.

6. A text feature extraction device, characterized by comprising:

the device comprises an acquisition unit, a prediction unit and a prediction unit, wherein the acquisition unit is used for acquiring a target task to be predicted;

the determining unit is used for determining a text set matched with the target task to be predicted according to the field corresponding to the target task to be predicted, wherein the text set comprises a plurality of texts;

and the extraction unit is used for extracting the text features of the texts in each text set through a pre-trained language model.

7. The text feature extraction device according to claim 6, wherein the language model is pre-trained in the following manner:

acquiring a plurality of training texts in a specified field;

inputting a plurality of training texts into a neural network model in batches;

8. The apparatus according to claim 7, wherein the language model extracts the text features of the training texts corresponding to the batch by using a random mask in the following manner, including:

9. The text feature extraction device according to claim 6, further comprising:

and the updating unit is used for adding the newly added text into a text set matched with the target task to be predicted if the newly added text exists in one or more fields corresponding to the target task to be predicted.

10. The text feature extraction device according to claim 6, wherein the domain includes: a general text field or an e-commerce field.

11. A text feature extraction device, characterized by comprising:

a memory to store instructions; and

a processor for invoking the memory-stored instructions to perform the text feature extraction method of any one of claims 1-5.

12. A computer-readable storage medium having stored therein instructions which, when executed by a processor, perform a text feature extraction method as claimed in any one of claims 1-5.