CN112232070A

CN112232070A - Natural language processing model construction method, system, electronic device and storage medium

Info

Publication number: CN112232070A
Application number: CN202011124616.1A
Authority: CN
Inventors: 张鹏涛; 景艳山
Original assignee: Beijing Minglue Zhaohui Technology Co Ltd
Current assignee: Beijing Minglue Zhaohui Technology Co Ltd
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2021-01-15

Abstract

The invention provides a natural language processing model construction method, a system, electronic equipment and a storage medium, wherein the technical scheme of the method comprises the steps of extracting information by adopting a combined extraction method, and mining information with different granularities, including word vectors, word vectors and the like corresponding to part-of-speech information; in addition, the method carries out negative sampling on the original training data to obtain a batch of negative samples, so that the problem of low resources of the model is solved, and the identification difficulty of the model is increased. The invention improves the implementation effect of extracting the unstructured text information and improves the robustness of the model.

Description

Natural language processing model construction method, system, electronic device and storage medium

Technical Field

The invention belongs to the field of data processing, and particularly relates to a natural language processing model construction method and system, electronic equipment and a storage medium.

Background

A large amount of unstructured texts exist in the field of language processing, particularly news texts, a large number of entities exist in the texts, different relationships exist among different entities, the unstructured texts can be effectively extracted, and the automatic understanding of the texts and the construction of a knowledge graph can be assisted.

The prior art mainly comprises a template-based method, a pipeline-based information extraction method and a semi-supervised-based information extraction method, but the encoder of the method is not strong enough, the feature dimensionality of the encoding is not rich enough, entity identification and relationship classification cannot be trained simultaneously, joint training cannot be performed by using original data, the problem of error accumulation between models cannot be solved, effective enhancement on the data cannot be performed, and the effect of the models is greatly influenced under the condition of less data.

Disclosure of Invention

The embodiment of the application provides a natural language processing model construction method, a natural language processing model construction system, electronic equipment and a storage medium, and aims to at least solve the problem that the extraction effect of current unstructured text information is poor.

In a first aspect, an embodiment of the present application provides a method for constructing a natural language processing model, including:

s101, marking original training data in a text sample;

s102, carrying out negative sampling on the original training data to obtain negative example data;

s103, combining the original training data and the negative example data into final training data;

s104, obtaining part-of-speech information in the text sample by using a natural language processing tool, and training a first word vector according to the final training data;

s105, converting words in the text sample into a word vector and a second word vector, and combining the word vector and the second word vector;

s106, obtaining an entity classification loss function of the text sample according to the first word vector, the word vector and the second word vector;

s107, obtaining a relation classification loss function of the text sample according to the relation information of the first word vector, the second word vector and the text sample;

and S108, adding the entity classification loss function and the relation classification loss function to obtain a combined loss function, and performing back propagation gradient operation on the combined loss function to obtain a natural language processing model.

Preferably, the original training data includes text, entities in the text, corresponding lengths of the text, tag sets of the entities, positions of the text pairs, and tag sets of the text pairs.

Preferably, the step S102 includes designating the number of entities and the number of relationships of the negative example, acquiring different entities, and forming corresponding relationships according to the different entities.

Preferably, the step S104 includes using Word2Vec when converting the first Word vector; the first word vector is a word vector corresponding to the part of speech information.

Preferably, the step S105 includes using RoBERTa when transforming the Word vector and using Word2Vec when transforming the second Word vector.

Preferably, the step S106 includes inputting the CLS vector information, the vector information of the entity length, and the vector information after the entity is maximally pooled into the entity classification model, splicing, and performing classification processing using a softmax function.

Preferably, the step S107 includes: and inputting the vector information of the head entity, the vector information of the tail entity and the vector information obtained by performing maximum pooling on the context information between the two entities into a relational classification model, splicing, and performing classification processing by using a softmax function.

In a second aspect, an embodiment of the present application provides a natural language processing model building system, which is suitable for the above natural language processing model building method, and includes:

a pretreatment unit: marking original training data in a text sample, carrying out negative sampling on the original training data to obtain negative example data, and combining the original training data and the negative example data into final training data;

a vector conversion unit: using a natural language processing tool to obtain part-of-speech information in the text sample, training a first word vector according to the final training data, converting words in the text sample into a word vector and a second word vector, and combining the word vector and the second word vector;

an entity classification loss function acquisition unit: obtaining an entity classification loss function of the text sample according to the first word vector, the word vector and the second word vector;

a relationship classification loss function acquisition unit: obtaining a relation classification loss function of the text sample according to the relation information of the first word vector, the second word vector and the text sample;

a model construction unit: and adding the entity classification loss function and the relation classification loss function to obtain a joint loss function, and performing back propagation gradient operation on the joint loss function to obtain a natural language processing model.

In some of these embodiments, the raw training data includes text, entities in the text, lengths to which the text corresponds, a set of labels for the entities, locations for pairs of the text, and a set of labels for pairs of the text.

In some embodiments, the preprocessing unit includes an entity number and a relationship number of a specified negative case, acquires different entities, and forms corresponding relationships according to the different entities.

In some of these embodiments, the vector conversion unit includes using Word2Vec in converting the first Word vector; the first word vector is a word vector corresponding to the part of speech information.

In some of these embodiments, the vector conversion unit includes using RoBERTa when converting the Word vector and using Word2Vec when converting the second Word vector.

In some embodiments, the entity classification loss function obtaining unit inputs the CLS vector information, the vector information of the entity length, and the vector information after the entity is maximally pooled into the entity classification model, and performs the splicing, and performs the classification processing using a softmax function.

In some embodiments, the relationship classification loss function obtaining unit includes: and inputting the vector information of the head entity, the vector information of the tail entity and the vector information obtained by performing maximum pooling on the context information between the two entities into a relational classification model, splicing, and performing classification processing by using a softmax function.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor, when executing the computer program, implements a natural language processing model building method as described in the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a natural language processing model building method as described in the first aspect above.

Compared with the related art, the natural language processing model construction method provided by the embodiment of the application comprises the following steps:

1. the information is extracted by adopting a combined extraction method, so that the information of an entity can be better utilized, the problem of error accumulation caused by pipeline can be reduced, a model is finally obtained, and the model is convenient to deploy.

2. The text information can be more fully encoded by adopting the encoding information with different granularities, including pos level-based information, RoBERTA level-based information, word2 vec-based word level information and CLS information in RoBERTA.

3. And carrying out negative sampling on the original sample to obtain a batch of negative samples, so as to solve the problem of low resources of the model and increase the identification difficulty of the model, thereby improving the robustness of the model.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow diagram of a method for constructing a natural language processing model according to an embodiment of the present application;

FIG. 2 is a framework diagram of a natural language processing model building system according to an embodiment of the present application;

FIG. 3 is a block diagram of an electronic device according to an embodiment of the present application;

in the above figures:

11. a pre-processing unit; 12. a vector conversion unit; 13. an entity classification loss function acquisition unit; 14. a relation classification loss function obtaining unit; 15. a model construction unit; 20. a bus; 21. a processor; 22. a memory; 23. a communication interface.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

A large amount of unstructured texts exist in the language processing field, a large number of entities exist in the texts, different relationships exist among different entities, the unstructured texts can be effectively extracted, and the automatic understanding of the texts and the construction of a knowledge graph can be assisted. The embodiment of the invention provides a method and a system for constructing a natural language processing model, electronic equipment and a storage medium, which are applicable to information extraction of unstructured texts. The embodiment of the invention can be used for unstructured texts in the news field.

Some of the terms of art to which the invention relates are described below:

information Extraction (IE) refers to extracting corresponding entities and relationships between entities from a piece of text, and specific techniques include named entity identification (NER) and relationship classification (RE). Named Entity Recognition (NER) is a very fundamental task in the fields of NLP, knowledge-graph, etc., aimed at locating and classifying named entities in text into predefined categories such as people, organizations, locations, temporal expressions, quantities, monetary values, percentages, etc. The effect of named entity recognition directly determines the effect of downstream tasks. The relation classification (RE) is a form of text classification, and performs a classification operation on the extracted entity pairs and the text information to obtain a relation of the entity pairs.

Word2vec, a group of correlation models used to generate Word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in word2 vec. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, and the vector is a hidden layer of the neural network.

BERT is a pre-training model issued by Google, and a fixed mask mode and a smaller batch size are adopted for training, so the power of BERT still needs to be further developed. In 2019, a deeper and larger pre-training model ROBERTA is published by a fackbook, the ROBERTA adopts a larger batch size, more data, longer training time and longer sentences, and a dynamic mask method is adopted to remove NSP tasks in BERTs, so that the ROBERTA obtains better performance and exceeds BERT in each index.

Softmax is a function used in the classification process to implement multi-classification, which simply maps some of the output neurons to real numbers between (0-1), and the normalization guarantees a sum of 1, so that the sum of the probabilities for the multi-classification is also exactly 1.

The maximum pooling is a common pooling operation that reduces the amount of data by the maximum value, and is generally performed by dividing an input image into several rectangular regions and outputting the maximum value for each sub-field. At present, the common pooling method has average pooling besides maximum pooling, reduces complex calculation from an upper hidden layer, can not be influenced by the inclination or rotation of a target, and can effectively reduce a sampling method of data dimension.

Referring to fig. 1, a flowchart of a method for constructing a natural language processing model according to an embodiment of the present application includes the following steps:

s101, marking original training data in a text sample;

In order to solve the problem of error accumulation caused by pipeline in the prior art, the embodiment of the invention provides a joint method for learning the entity in the text and the relationship between the entity pair, namely, considering the entity classification loss function and the relationship classification loss function at the same time.

It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.

The original training data comprises a text, entities in the text, lengths corresponding to the text, a label set of the entities, positions of text pairs and a label set of the text pairs.

The step S102 includes designating the number of entities and the number of relationships of the negative examples, acquiring different entities, and forming corresponding relationships according to the different entities.

Wherein, the step S104 comprises using Word2Vec when converting the first Word vector; the first word vector is a word vector corresponding to the part of speech information.

The ROBERTA is adopted as an encoder, and the ROBERTA adopts measures of larger batch size, more data, longer training time, longer sentences, dynamic mask and the like, so that the effect of the ROBERTA is better than that of bert in each data set. While taking into account the length of each entity, each entity is assigned a width vector. Since the bert series models are all word segmentation based on bytes, and information at a word level and information at a pos level are not considered, the embodiment of the invention proposes to blend information of word vectors and information at a pos level into feature codes.

Wherein, the step S105 comprises using RoBERTA when converting the Word vector and using Word2Vec when converting the second Word vector.

In some of these embodiments, the pre-training model used in transforming the word vector may also be any of xlnet, albert, t 5.

The step S106 includes inputting the CLS vector information, the vector information of the entity length, and the vector information after the entity is maximally pooled into the entity classification model, and performing a splicing process, and performing a classification process using a softmax function.

Wherein the step S107 includes: and inputting the vector information of the head entity, the vector information of the tail entity and the vector information obtained by performing maximum pooling on the context information between the two entities into a relational classification model, splicing, and performing classification processing by using a softmax function.

The embodiment of the application provides a natural language processing model construction system, which is suitable for the natural language processing model construction method. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware or a combination of software and hardware is also possible and contemplated.

Fig. 2 is a framework diagram of a natural language processing model building system according to an embodiment of the present application, and includes a preprocessing unit 11, a vector transformation unit 12, an entity classification loss function obtaining unit 13, a relationship classification loss function obtaining unit 14, and a model building unit 15, where:

the preprocessing unit 11: marking original training data in a text sample, carrying out negative sampling on the original training data to obtain negative example data, and combining the original training data and the negative example data into final training data;

the vector conversion unit 12: using a natural language processing tool to obtain part-of-speech information in the text sample, training a first word vector according to the final training data, converting words in the text sample into a word vector and a second word vector, and combining the word vector and the second word vector;

entity classification loss function acquisition unit 13: obtaining an entity classification loss function of the text sample according to the first word vector, the word vector and the second word vector;

the relationship classification loss function acquisition unit 14: obtaining a relation classification loss function of the text sample according to the relation information of the first word vector, the second word vector and the text sample;

the model construction unit 15: and adding the entity classification loss function and the relation classification loss function to obtain a joint loss function, and performing back propagation gradient operation on the joint loss function to obtain a natural language processing model.

In some embodiments, the preprocessing unit 11 includes an entity number and a relationship number of a specified negative example, obtains different entities, and forms corresponding relationships according to the different entities.

In some of these embodiments, the vector conversion unit 12 includes using Word2Vec in converting the first Word vector; the first word vector is a word vector corresponding to the part of speech information.

In some of these embodiments, the vector conversion unit 12 includes RoBERTa for converting the Word vector and Word2Vec for converting the second Word vector.

In some embodiments, the entity classification loss function obtaining unit 13 inputs and splices CLS vector information, vector information of entity length, and vector information after the entity is maximally pooled into an entity classification model, and performs classification processing using a softmax function.

In some of these embodiments, the relationship classification loss function obtaining unit 14 includes: and inputting the vector information of the head entity, the vector information of the tail entity and the vector information obtained by performing maximum pooling on the context information between the two entities into a relational classification model, splicing, and performing classification processing by using a softmax function.

The above units may be functional units or program units, and may be implemented by software or hardware. For units implemented by hardware, the units may be located in the same processor; or the units may be located in different processors in any combination.

In addition, the method for constructing the natural language processing model according to the embodiment of the present application described in conjunction with fig. 1 may be implemented by an electronic device. Fig. 3 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

The computer device may comprise a processor 21 and a memory 22 in which computer program instructions are stored.

Specifically, the processor 21 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 22 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 22 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, magnetic tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 22 may include removable or non-removable (or fixed) media, where appropriate. The memory 22 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 22 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 22 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory 22 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions executed by the processor 21.

The processor 21 realizes any one of the natural language processing model construction methods in the above embodiments by reading and executing computer program instructions stored in the memory 22.

In some of these embodiments, the computer device may also include a communication interface 23 and a bus 20. As shown in fig. 2, the processor 21, the memory 22, and the communication interface 23 are connected via the bus 20 to complete mutual communication.

The communication port 23 may be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.

The bus 20 includes hardware, software, or both to couple the components of the electronic device to one another. Bus 20 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 20 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 20 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

The computer device may execute a natural language processing model building method in the embodiment of the present application.

In addition, in combination with the natural language processing model construction method in the foregoing embodiment, the embodiment of the present application may be implemented by providing a computer-readable storage medium. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the natural language processing model construction methods of the embodiments described above.

And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A natural language processing model construction method is characterized by comprising the following steps:

s101, marking original training data in a text sample;

2. The method of constructing a natural language processing model of claim 1, wherein the raw training data includes text, entities in the text, lengths to which the text corresponds, a set of labels for the entities, locations of pairs of the text, and a set of labels for the pairs of the text.

3. The method for constructing a natural language processing model according to claim 1, wherein the step S102 comprises specifying the number of entities and the number of relationships of the negative examples, obtaining different entities, and composing corresponding relationships according to the different entities.

4. The natural language processing model building method of claim 1, wherein the step S104 includes using Word2Vec when transforming the first Word vector; the first word vector is a word vector corresponding to the part of speech information.

5. The method of constructing a natural language processing model of claim 1 wherein said step S105 comprises using RoBERTa for translating said Word vector and using Word2Vec for translating said second Word vector.

6. The method for constructing a natural language processing model according to claim 1, wherein the step S106 comprises inputting and splicing CLS vector information, vector information of entity length, and vector information after the entity is maximally pooled into the entity classification model, and performing classification processing using a softmax function.

7. The natural language processing model building method of claim 1, wherein the step S107 includes: and inputting the vector information of the head entity, the vector information of the tail entity and the vector information obtained by performing maximum pooling on the context information between the two entities into a relational classification model, splicing, and performing classification processing by using a softmax function.

8. A natural language processing model building system, comprising:

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the natural language processing model building method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium on which a computer program is stored, the program, when being executed by a processor, implementing a natural language processing model construction method according to claims 1 to 7.